This the third posting in the series on heterogeneous computing. The first two were:

1. Heterogeneous Computing using GPGPUs and FPGAs

2. Heterogeneous Computing using GPGPUs: NVidia GT200

This post looks more deeply at the AMD/ATI RV770.

The latest GPU from AMD/ATI is the RV770 architecture. The processor contains 10 SIMD cores, each with 16 streaming processor (SP) units. The SIMD cores are similar to NVidia’s Texture Processor Cluster (TPC) units (the NVidia GT200 also has 10 of these), and the 10*16 = 160 SPs are “execution thread granularity” similar to NVidia’s SP units (GT200 has 240 of these). Unlike NVidia’s design which executes 1 instruction per thread, each SP on the RV770 executes packed 5-wide VLIW-style instructions. For graphics and visualization workloads, floating point intensity is high enough to average about 4.2 useful operations per cycle. On dense data parallel operations (ex. dense matrix multiply), all 5 ALUs can easily be used.

The ALUs in each SP are named x, y, z, w and t. x, y, z and w are symmetric, and capable of retiring a single precision floating point multiply-add per cycle. The t unit is a Special Function Unit (SFU) capable of everything an xyzw ALU can do, plus transcendental functions like sin, cos, etc. There is also a branch unit in each SP to deal with shader program branches.

From this information, we can see that when people are talking about 800 “shader cores” or “threads” or “streaming processors”, they are actually referring to the 10*16*5 = 800 xyzwt ALUs. This can be confusing, because there are really only 160 simultaneous instruction pipelines. Also, both NVidia and AMD use symmetric single issue streaming multiprocessor architectures, so branches are handled very differently from CPUs.

The RV770 is used in the desktop Radeon 4850 and 4870 video cards, and evidently the “workstation” FireStream 9250 and FirePro V8700. The Radeon 48×0 X2 “enthusiast desktop” cards have two RV770s on the same card. Like NVidia Quadro cards, the typical difference between the “desktop” and “workstation” cards is that the workstation card has anti-aliased (AA) line capability enabled (primarily for the CAD market) and it costs 5-10 times as much.

[The computing cores always have AA line capability, so it’s probably more accurate to say that the desktop cards have this capability disabled. Theoretically, foundry binning could sort processors with hard faults in the “anti-aliased line hardware” as “desktop” processors. However, this probably never really happens since this is just a tiny bit of instruction decode logic or microcode that sends “lines” to shared setup logic that triangles are computed on. Likewise, the NVidia Tesla boards are just GT200 processors with potentially some extra compliance testing and more (non-ECC) board memory. Arguably, these artificially maintained high margin product lines are what keep these companies profitable; industrial design subsidizes gamers!]

Double precision floating point is accomplished by fusing the xyzw ALUs within an SP into two pairs. These two double units can perform either multiply or add (but not both) each cycle. The t unit is unaffected by this fused mode, and ALU/transcendental operations can be co-scheduled alongside the doubles just like with single precision-only VLIW issue.

Local card memory is 512MB of GDDR3 for the 4850 and 1GB of GDDR5 for the 4870. Both use a 256 bit wide bus, but GDDR3 is 2 channel while GDDR5 is 4 channel.

Let’s look at peak performance numbers for the Radeon 4870, clocked at reference 750MHz. Keep in mind that all of the ALUs are capable of multiply-add instructions (2 flop/cycle):

= 750MHz/s * 10 SPMD * 16 SIMD/SPMD * 5 ALU/SIMD * 2 flop/cycle per ALU

= 1200000M flop/s = 1.2 TFlop/s

For double precision:

= 750MHz/s * 10 * 16 * 2 “double FPU” * 1 Flop/cycle per “double FPU”

= 240 GFlop/s double precision + 240 GFlop/s single precision on the 160 t SFUs

Reference memory frequency is 900 MHz:

= 900MHz/s * 4 channels * 256 bits/channel = 115 GB/s

Here are peak performance numbers for some RV770 cards:

Single Double Bandwidth TDP Power Cost

· Radeon 4850 1000 GFlop/s 200 GFlop/s 64 GB/s 180W $130

· Radeon 4870 1200 240 115 200 $180

· 4850 X2 2000 400 127 230 $255

· 4870 X2 2400 480 230 285 $420

· FireStrm 9250 1000 200 64 180 $790 (same as 4850)

· FirePro V8700 1200 240 115 200 $1130 (same as 4870)

The Radeon 4850 X2 is the cheapest compute capability per retail dollar available outside of DSPs and fixed function ASICs. However, it’s bandwidth is very low compared to floating point horsepower – if it executes less than 63 floating point instructions for every F32 piece of data that must be fetched from memory, then memory bandwidth will be the bottleneck! The 4870 is better balance at a computational intensity breakpoint of 42. However, NVidia’s cards are applicable to a wider range of workloads; the GTX 285 has a breakpoint of 27 instructions (less compute power, more bandwidth). For reference a Core i7 is about 16, and CPU caches are much bigger than GPU “caches” so there is a more opportunity to reuse data before fetching off-chip.

Thanks to Mike Marr for the research and the detailed write-up above. Errors or omissions are mine.


James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | | | blog: