Years ago I incorrectly believed special purpose hardware was a bad idea. What was a bad idea is high-markup, special purpose devices sold at low volume, through expensive channels. Hardware implementations are often best value measured in work done per dollar and work done per joule. The newest breed of commodity networking parts from Broadcom, Fulcrum, Dune (now Broadcom), and others is a beautiful example of Application Specific Integrated Circuits being the right answer for extremely hot code kernels that change rarely.
I’ve long been interested in highly parallel systems and in heterogeneous processing. General Purpose Graphics Processors are firmly hitting the mainstream with 17 of the Top 500 now using GPGPUs (Top 500: Chinese Supercomputer Reigns). You can now rent GPGPU clusters from EC2 $2.10/server/hour where each server has dual NVIDIA Tesla M2050 GPUs delivering a TeraFLOP per node. For more on GPGPUs, see HPC in the Cloud with GPGPUs and GPU Clusters in 10 minutes.
Some time ago Zach Hill sent me a paper writing up Radix sort using GPGPUs. The paper shows how to achieve a better than 3x on the NVIDIA GT200-hosted systems. For most of us, sort isn’t the most important software kernel we run, but I did find the detail behind the GPGPU-specific optimizations interesting. The paper is at http://www.mvdirona.com/jrh/TalksAndPapers/RadixSortTRv2.pdf and the abstract is below.
This paper presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Compared to the state-of-the-art, our radix sorting methods exhibit speedup of at least 2x for all generations of NVIDIA GPGPUs, and up to 3.7x for current GT200-based models. Our implementations demonstrate sorting rates of 482 million key-value pairs per second, and 550 million keys per second (32-bit). For this domain of sorting problems, we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture. These results motivate a different breed of parallel primitives for GPGPU stream architectures that can better exploit the memory and computational resources while maintaining the flexibility of a reusable component. Our sorting performance is derived from a parallel scan stream primitive that has been generalized in two ways: (1) with local interfaces for producer/consumer operations (visiting logic), and (2) with interfaces for performing multiple related, concurrent prefix scans (multi-scan).
As part of this work, we demonstrate a method for encoding multiple compaction problems into a single, composite parallel scan. This technique yields a 2.5x speedup over bitonic sorting networks for small problem instances, i.e., sequences that can be entirely sorted within the shared memory local to a single GPU core.