Heterogeneous Computing using GPGPUs and FPGAs

It’s not at all uncommon to have several different instruction sets employed in a single computer. Decades ago IBM mainframes had I/O processing systems (channel processors). Most client systems have dedicated graphics processors. Many networking cards off-load the transport stack (TCP/IP off load). These are all examples of special purpose processors used to support general computation. The application programmer doesn’t directly write code for them.

I define Heterogeneous computing as the application of processors with different instruction set architectures (ISA) under direct application programmer control. Even heterogeneous processing has been around for years in that application programs have long had access to dedicated floating point coprocessors with instructions not found on the main CPU. FPUs where first shipped as coprocessors but have since been integrated on-chip with the general CPU. FPU complexity has usually been hidden behind compilers that generated FPU instructions when needed or by math libraries that could be called directly by the application program.

It’s difficult enough to program symmetric multi-processors (SMPs) where the application program runs over many identical processors in parallel. Heterogeneous processing typically also employs more than one processor but these different processors don’t all share the same ISA. Why would anyone want to accept this complexity? Speed and efficiency. General purpose processors are, well, general. And as a rule, general purpose processors are easy to program but considerably less efficient than specialized processors at some operations. Graphics can be several orders of magnitude more efficient in silicon than in software and, as a consequence, almost all graphics is done on graphics processors. Network processing is another example of a very repetitive task where in-silicon implementations are at least an order of magnitude faster. As a consequence, it’s not unusual to see network switches where the control plane is implemented on a general purpose processor but the data plane is all done on an Application Specific Integrated Circuit (ASIC).

Looking at still more general systems that employ heterogeneous processing, newer supercomputers like RoadRunner, which took top spot in the super computer Top500 list last June, are good examples. RoadRunner is a massive cluster of 6,562 X86 dual core processors and 12,241 IBM Cell Processors. The Cell processor was originally designed by Sony, Toshiba, and IBM and was first commercially used in the Sony Playstation 3. The cell processors themselves are heterogeneous components made up 9 processors, 1 control processor called a Power Processing Element (PPE) and 8 Synergistic Processing Elements (SPE). The bulk of the application performance comes from the SPEs but they can’t run without the PPE which hosts the operating system and manages the SPEs. Although RoadRunner consumes a prodigious 2.35MW – more than a small power plant – it is actually much more efficient than comparable performing systems not using heterogeneous processing.

Hardware specialization can be cheaper, faster, and far more power efficient. Traits that are hard to ignore. Heterogeneous systems are beginning to look pretty interesting for some very important commercial workloads. Over the last 9 months I’ve been interested in two classes of heterogeneous systems and their application to commercial workloads:

· GPGPU: General Purpose computation on Graphics Unit Processing (GPU)

· FPGA: Field Programmable Grid Array (FPGA) Coprocessors

I’ve seen both techniques used experimentally in petroleum exploration (seismic analysis) and in hedge fund analysis clusters (financial calculations). GPGPUS are being used commercially in rendering farms. Research work is active across the board. Programming tools are emerging to make these systems easier to program.

Heterogeneous computing is being used commercially and usage is spreading rapidly. In the next two articles I’ll post guest blog entries from Mike Marr describing the hardware architecture for two GPUs, the Nvidia GT200 and the AMD RV770. In a subsequent article I’ll look more closely at a couple of FPGA options available for mainstream heterogeneous programming.

–jrh

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com

H:mvdirona.com | W:mvdirona.com/jrh/work | blog:http://perspectives.mvdirona.com

2 comments on “Heterogeneous Computing using GPGPUs and FPGAs
  1. Larrabee is indeed going to be an interesting system.

    You are correct that FPGAs perf/watt than ASIC. However, FPGAs are more malleable and cheaper to evolve so a common technique is use design tools that can target both FPGAs and ASIC. Early prototypes and beta units are FPGA hosted and, once working well, then move to ASIC and get the advantage of smaller, faster, and lower power operation. For very low volume applications, ASIC may not ever make sense and sticking with FPGAs might be more viable approach.

    It depends upon expected volume and the current stability of your design. I see room for both ASIC and FPGAs.

    –jrh

  2. Yang Luo says:

    It will be interesting to see how Intel’s Larrabee x86 multicore performs as a heterogeneous computing platform. Programmability does matter.

    Some research also shows that FPGA has much worse perf/watt than ASIC:

    Kuon, I. and Rose, J. 2006. Measuring the gap between FPGAs and ASICs. In Proceedings of the 2006 ACM/SIGDA 14th international Symposium on Field Programmable Gate Arrays (Monterey, California, USA, February 22 – 24, 2006). FPGA ’06. ACM, New York, NY, 21-30.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.