A year and half ago, I did a blog post titled heterogeneous computing using GPGPUs and FPGA. In that note I defined heterogeneous processing as the application of processors with different instruction set architectures (ISA) under direct application programmer control and pointed out that this really isn’t all that new a concept. We have had multiple ISAs in systems for years. IBM mainframes had I/O processors (Channel I/O Processors) with a different ISA than the general CPUs , many client systems have dedicated graphics coprocessors, and floating point units used to be independent from the CPU instruction set before that functionality was pulled up onto the chip. The concept isn’t new.
What is fairly new is 1) the practicality of implementing high-use software kernels in hardware and 2) the availability of commodity priced parts capable of vector processing. Looking first at moving custom software kernels into hardware, Field Programmable Gate Arrays (FPGA) are now targeted by some specialized high level programming languages. You can now write in a subset of C++ and directly implement commonly used software kernels in hardware. This is still the very early days of this technology but some financial institutions have been experimenting with moving computationally expensive financial calculations into FPGAs to save power and cost. See Platform-Based Electronic System-Level (ESL) Synthesis for more on this.
The second major category of heterogeneous computing is much further evolved and is now beginning to hit the mainstream. Graphics Processing Units (GPUs) essentially are vector processing units in disguise. They originally evolved as graphics accelerators but it turns out a general purpose graphics processor can form the basis of an incredible Single Instruction Multiple Data (SIMD) processing engine. Commodity GPUs have most of the capabilities of the vector units in early supercomputers. What’s missing is they have been somewhat difficult to program in that the pace of innovation is high and each model of GPU have differences in architecture and programming models. It’s almost impossible to write code that will directly run over a wide variety of different devices. And the large memories in these graphics processors typically are not ECC protected. An occasional pixel or two wrong doesn’t really matter in graphical output but you really do need ECC memory for server side computing.
Essentially we have commodity vector processing units that are hard to program and lack ECC. What to do? Add ECC memory and a abstraction layer that hides many of the device-to-device differences. With those two changes, we have amazingly powerful vector units available at commodity prices. One abstraction layer that is getting fairly broad pickup is Compute Unified Device Architecture or CUDA developed by NVIDIA. There are now CUDA runtime support libraries for many programming languages including C, FORTRAN, Python, Java, Ruby, and Perl.
Current generation GPUS are amazingly capable devices. I’ve covered the speeds and feeds of a couple in past postings: ATI RV770 and the NVIDIA GT200.
Bringing it all together, we have commodity vector units with ECC and an abstraction layer that makes it easier to program them and allows programs to run unchanged as devices are upgraded. Using GPUs to host general compute kernels is generically referred to as General Purpose Computation on Graphics Processing Units. So what is missing at this point? The pay-as-you go economics of cloud computing.
You may recall I was excited last July when Amazon Web Services announced the Cluster Compute Instance: High Performance Computing Hits the Cloud. The EC2 Cluster Compute Instance is capable but lacks a GPU:
· 23GB memory with ECC
· 64GB/s main memory bandwidth
· 2 x Intel Xeon X5570 (quad-core Nehalem)
· 2 x 845GB HDDs
· 10Gbps Ethernet Network Interface
What Amazon Web Services just announced is a new instance type with the same core instance specifications as the cluster compute instance above, the same high-performance network, but with the addition of two NVIDIA Tesla M2050 GPUs in each server. See supercomputing at 1/10th the Cost. Each of these GPGPUs is capable of over 512 GFLOPs and so, with two of these units per server, there is a booming teraFLOP per node.
Each server in the cluster is equipped with a 10Gbps network interface card connected to a constant bisection bandwidth networking fabric. Any node can communicate with any other node at full line rate. It’s a thing of beauty and a forerunner of what every network should look like.
There are two full GPUs in each Cluster GPU instance each of which dissipates 225W TDP. This felt high to me when I first saw it but, looking at work done per watt, it’s actually incredibly good for workloads amenable to vector processing. The key to the power efficiency is the performance. At over 10x the performance of a quad core x86, the package is both power efficient and cost efficient.
The new cg1.4xlarge EC2 instance type:
· 2 x NVIDIA Tesla M2050 GPUs
· 22GB memory with ECC
· 2 x Intel Xeon X5570 (quad-core Nehalem)
· 2 x 845GB HDDs
· 10Gbps Ethernet Network Interface
With this most recent announcement, AWS now has dual quad core servers each with dual GPGPUs connected by a 10Gbps full-bisection bandwidth network for $2.10 per node hour. That’s $2.10 per teraFLOP. Wow.
· The Amazon Cluster GPU Instance type announcement: Announcing Cluster GPU Instances for EC2
· More information on the EC2 Cluster GPU instances: http://aws.amazon.com/ec2/hpc-applications/
–jrh
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com