In Heterogeneous Computing using GPGPUs and FPGAs I looked at the Heterogeneous computing, the application of multiple instruction set architectures within a single application program under direct programmer control. Heterogeneous computing has been around for years but usage has been restricted to fairly small niches. I’m predicting that we’re going to see abrupt and steep growth over the next couple of years. The combination of delivering results for many workloads cheaper, faster, and more power efficiently coupled with improved programming tools is going to vault GPGPU programming into being a much more common technique available to everyone.
Following on from the previous positing, Heterogeneous Computing using GPGPUs and FPGAs, in this one we’ll take a detailed look at the NVidia GT200 GPU architecture and, in the next, the AMD/ATI RV770.
The latest NVidia GPU is called the GT200 (“GT” stands for: Graphics Tesla). The processor contains 10 Texture/Processor Clusters (TPC) each with 3 Single Program Multiple Data (SPDM) computing cores which NVidia calls Streaming Multiprocessors (SM). Each has two instruction issue ports (I’ll call them Port 0 and Port 1):
· Port 0 can issue instructions to 1 of 3 groupings of functional units on any given cycle:
o “SIMT” (Single Instruction Multiple Thread) instructions to 8 single precision floating point units, marketed as “Stream Processors (SP) a.k.a. thread processors or shader cores
o a double precision floating point unit
o 8 way branch unit that manages state for the SIMT execution (basically, it deals with branch instructions in shader programs)
· Port 1 can issue instructions to two Special Function Units (SFU) each of which can process packed 4-wide vectors. The SFUs perform transcendental operations like sin, cos, etc. or single precision multiplies (like the Intel SSE instruction: MULPS)
From this information, you can derive some common marketing numbers for this hardware:
· “240 stream processors” are the 10*3*8 = 240 single precision FPUs on Port 0.
· “30 double precision pipelines” are the 10*3* 1 = 30 double precision FPUs on Port 0.
· “dual-issue” is the fact that you can (essentially) co-issue instructions to both Port 0 and Port 1.
The GT200 is used in the line of “GeForce GTX 2xx” commodity video cards (ex. GeForce GTX 280) and the Tesla C1060 [there will also be a Quadro NVS part]. The Tesla S1070 is a PCI bridge that packages four Tesla C1060s into a 1U rack unit – since it is just a bridge, it still requires a host rack unit to drive the GPUs. The GeForce GTX 295 packages two GT200 processors on the same card (similar to AMD Radeon 48xx X2 cards).
Total transistor count is 1.4B – about twice the number of an Intel quad Core i7 or AMD RV770. The GeForce GTX 2x5 parts (ex. GeForce GTX 285) are die shrunk versions of the original core: 55nm vs. 65nm. On the original 65nm process, the GT200 was 583.2 mm2, or about 6 times the surface area of a dual-core Penryn. A 300mm wafer produced only 94 processors (where 45nm Atom processors would yield about 2500).
Local card memory is GDDR3 configured as 2 channels with a bus width of 512 bits – typically 1GB.
The original GTX 260 was a GT200 which disabled 2 of the 10 TPC units (for a total of 24 SMs or 192 SPs) – presumably to deal with manufacturing hard faults in some of the cores. It also disables part of the memory bus: 448 bits instead of 512 and consequently local memory is only 896MB. [Disabling parts of a chip is a now common manufacturing strategy to more fully monetize die yields on modular circuit designs – Intel has been doing this for years with L2 caches.] As the fab process improved, NVidia started shipping the GTX 260-216, which disables only 1 of the TPCs, and is apparently the only GTX260 part that is actually being manufactured nowadays (216 = 3*9*8, so refers to the number of shader cores).
Let’s look at peak performance numbers for the GTX 280, reference clocked at 1296 MHz. Notice that Port 0 instructions can be multiply-adds (2 flop/cycle) and Port 1 instructions are just multiplies (1 flop/cycle):
1296 MHz/s * 30 SM * (8 SP/SM * 2 flop/cycle per SP + 2 SFU * 4 FPU/SFU * 1 flop/cycle per FPU)
= Port 0 throughput + Port 1 throughput = 622080 Mflop/s + 311040 Mflop/s = 933 GFlop/s single precision
For double precision:
1296MHz/s * 30 SM * 1 double precision FPU * 2 flop/cycle = 78 GFlop/s
The Port 1 units can be co-issued with double precision instructions, so can also process 311GFlop/s of single precision multiplies while doing double precision multiply-adds. [That’s probably not terribly useful without single precision adds though.]
Reference memory frequency is 1107 MHz:
1107 MHz/s * 2 channels * 512 bits/channel = 142 GB/s
Here are the peak performance numbers for various parts:
Single Precision Double Precision Bandwidth
· GTX 260-216: 805 GFlop/s 67 GFlop/s 112 GB/s
· GTX 280: 933 78 142
· GTX 285: 1062 89 159
· GTX 295: 1789 149 224
· Tesla C1060: 933 78 102
Notice the GTX 285 breaks the single core 1 Teraflop/s barrier. The Tesla card has the lowest bandwidth; this is presumably because there is 4GB of local memory instead of just 1 GB as on the GTX 285 (more memory typically requires lower bus clock rate). Finally, notice that even the GTX 285 still gets less than twice the double precision throughput of an AMD Phenom II 940 or Intel Core i7, both of which get about 50 GFlop/s for double and don’t require sophisticated latency hiding data transfer or a complex programming model.
Thanks to Mike Marr for the research and the detailed write-up above. Errors or omissions are mine.
James Hamilton, Amazon Web Services
1200, 12th Ave. S., Seattle, WA, 98144W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | firstname.lastname@example.org
H:mvdirona.com | W:mvdirona.com/jrh/work | blog:http://perspectives.mvdirona.com
Disclaimer: The opinions expressed here are my own and do not
necessarily represent those of current or past employers.