Wednesday, March 18, 2009

This the third posting in the series on heterogeneous computing. The first two were:

1.       Heterogeneous Computing using GPGPUs and FPGAs

2.       Heterogeneous Computing using GPGPUs:  NVidia GT200

 

This post looks more deeply at the AMD/ATI RV770.

 

The latest GPU from AMD/ATI is the RV770 architecture.  The processor contains 10 SIMD cores, each with 16 streaming processor (SP) units.   The SIMD cores are similar to NVidia’s Texture Processor Cluster (TPC) units (the NVidia GT200 also has 10 of these), and the 10*16 = 160 SPs are “execution thread granularity” similar to NVidia’s SP units (GT200 has 240 of these).  Unlike NVidia’s design which executes 1 instruction per thread, each SP on the RV770 executes packed 5-wide VLIW-style instructions.  For graphics and visualization workloads, floating point intensity is high enough to average about 4.2 useful operations  per cycle.  On dense data parallel operations (ex. dense matrix multiply), all 5 ALUs can easily be used.

 

The ALUs in each SP are named x, y, z, w and t.  x, y, z and w are symmetric, and capable of retiring a single precision floating point multiply-add per cycle.  The t unit is a Special Function Unit (SFU) capable of everything an xyzw ALU can do, plus transcendental functions like sin, cos, etc.  There is also a branch unit in each SP to deal with shader program branches.

 

From this information, we can see that when people are talking about 800 “shader cores” or “threads” or “streaming processors”, they are actually referring to the 10*16*5 = 800 xyzwt ALUs.  This can be confusing, because there are really only 160 simultaneous instruction pipelines.  Also, both NVidia and AMD use symmetric single issue streaming multiprocessor architectures, so branches are handled very differently from CPUs. 

 

The RV770 is used in the desktop Radeon 4850 and 4870 video cards, and evidently the “workstation” FireStream 9250 and FirePro V8700.  The Radeon 48x0 X2 “enthusiast desktop” cards have two RV770s on the same card. Like NVidia Quadro cards, the typical difference between the “desktop” and “workstation” cards is that the workstation card has anti-aliased (AA) line capability enabled (primarily for the CAD market) and it costs 5-10 times as much.    

 

[The computing cores always have AA line capability, so it’s probably more accurate to say that the desktop cards have this capability disabled.  Theoretically, foundry binning could sort processors with hard faults in the “anti-aliased line hardware” as “desktop” processors.  However, this probably never really happens since this is just a tiny bit of instruction decode logic or microcode that sends “lines” to shared setup logic that triangles are computed on.  Likewise, the NVidia Tesla boards are just GT200 processors with potentially some extra compliance testing and more (non-ECC) board memory.  Arguably, these artificially maintained high margin product lines are what keep these companies profitable; industrial design subsidizes gamers!]

 

Double precision floating point is accomplished by fusing the xyzw ALUs within an SP into two pairs.  These two double units can perform either multiply or add (but not both) each cycle.  The t unit is unaffected by this fused mode, and ALU/transcendental operations can be co-scheduled alongside the doubles just like with single precision-only VLIW issue.

 

Local card memory is 512MB of GDDR3 for the 4850 and 1GB of GDDR5 for the 4870.  Both use a 256 bit wide bus, but GDDR3 is 2 channel while GDDR5 is 4 channel.

 

Let’s look at peak performance numbers for the Radeon 4870, clocked at reference 750MHz.  Keep in mind that all of the ALUs are capable of multiply-add instructions (2 flop/cycle):

= 750MHz/s * 10 SPMD * 16 SIMD/SPMD * 5 ALU/SIMD * 2 flop/cycle per ALU

= 1200000M flop/s = 1.2 TFlop/s

For double precision:

= 750MHz/s * 10 * 16 * 2 “double FPU” * 1 Flop/cycle per “double FPU”

= 240 GFlop/s double precision + 240 GFlop/s single precision on the 160 t SFUs

 

Reference memory frequency is 900 MHz:

= 900MHz/s * 4 channels * 256 bits/channel = 115 GB/s

 

Here are peak performance numbers for some RV770 cards:

                                                Single                    Double                 Bandwidth          TDP Power          Cost

·         Radeon 4850      1000 GFlop/s      200 GFlop/s        64 GB/s                180W                     $130

·         Radeon 4870      1200                       240                         115                         200                         $180

·         4850 X2                 2000                       400                         127                         230                         $255

·         4870 X2                 2400                       480                         230                         285                         $420

·         FireStrm 9250    1000                       200                         64                           180                         $790       (same as 4850)

·         FirePro V8700    1200                       240                         115                         200                         $1130    (same as 4870)

 

The Radeon 4850 X2 is the cheapest compute capability per retail dollar available outside of DSPs and fixed function ASICs.  However, it’s bandwidth is very low compared to floating point horsepower – if it executes less than 63 floating point instructions for every F32 piece of data that must be fetched from memory, then memory bandwidth will be the bottleneck!  The 4870 is better balance at a computational intensity breakpoint of 42.  However, NVidia’s cards are applicable to a wider range of workloads; the GTX 285 has a breakpoint of 27 instructions (less compute power, more bandwidth).  For reference a Core i7 is about 16, and CPU caches are much bigger than GPU “caches” so there is a more opportunity to reuse data before fetching off-chip.

 

Thanks to Mike Marr for the research and the detailed write-up above. Errors or omissions are mine.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Wednesday, March 18, 2009 4:09:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
Comments are closed.

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<March 2009>
SunMonTueWedThuFriSat
22232425262728
1234567
891011121314
15161718192021
22232425262728
2930311234

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton