In the Rules of Thumb post, I argued that many of the standard engineering rules the thumb are changing. On a closely related point, Nishant Dani and Vlad Sadovsky both pointed me towards The Landscape of Parallel Computing Research: A View from Berkeley by David Patterson et al. Dave Patterson is best known for foundational work on RISC and for co-inventing RAID. He has an amazing ability to spot a problem where the solution is near, the problem is worth solving, and then come up with practical solutions. This paper has many co-authors but shows some of that same style. It focuses on parallel systems and some of the conventional wisdom that has driven systems designs for some time that are no longer correct. The Berkeley web site with more detail is at: http://view.eecs.berkeley.edu/wiki/Main_Page.
In the paper they argue that 13 computational kernels can be used to characterize most workloads. Then they go on to observe that over ½ of these kernels are memory bound today and we expect more to be in the future. In effect, the problem is getting data up the storage and memory hierarchy to the processors not the speed of the processors themselves. This has been true for years and the problems worsens each year and yet it still seems that the problem gets less focus than scaling processors speeds even though the later won’t help without the first.
If you are interested in parallel systems, it’s worth reading the paper. I’ve included the key changes in conventional wisdom below:
1. Old CW: Power is free, but transistors are expensive.
· New CW is the “Power wall”: Power is expensive, but transistors are “free”. That
is, we can put more transistors on a chip than we have the power to turn on.
2. Old CW: If you worry about power, the only concern is dynamic power.
· New CW: For desktops and servers, static power due to leakage can be 40% of
total power. (See Section 4.1.)
3. Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors
occurring only at the pins.
· New CW: As chips drop below 65 nm feature sizes, they will have high soft and
hard error rates. [Borkar 2005] [Mukherjee et al 2005]
4. Old CW: By building upon prior successes, we can continue to raise the level of
abstraction and hence the size of hardware designs.
· New CW: Wire delay, noise, cross coupling (capacitive and inductive),
manufacturing variability, reliability (see above), clock jitter, design validation,
and so on conspire to stretch the development time and cost of large designs at 65
nm or smaller feature sizes. (See Section 4.1.)
5. Old CW: Researchers demonstrate new architecture ideas by building chips.
· New CW: The cost of masks at 65 nm feature size, the cost of Electronic
Computer Aided Design software to design such chips, and the cost of design for
GHz clock rates means researchers can no longer build believable prototypes.
Thus, an alternative approach to evaluating architectures must be developed. (See
Section 7.3.)
6. Old CW: Performance improvements yield both lower latency and higher
bandwidth.
· New CW: Across many technologies, bandwidth improves by at least the square
of the improvement in latency. [Patterson 2004]
7. Old CW: Multiply is slow, but load and store is fast.
· New CW is the “Memory wall” [Wulf and McKee 1995]: Load and store is slow,
but multiply is fast. Modern microprocessors can take 200 clocks to access
Dynamic Random Access Memory (DRAM), but even floating-point multiplies
may take only four clock cycles.
The Landscape of Parallel Computing Research: A View From Berkeley
6
8. Old CW: We can reveal more instruction-level parallelism (ILP) via compilers
and architecture innovation. Examples from the past include branch prediction,
out-of-order execution, speculation, and Very Long Instruction Word systems.
· New CW is the “ILP wall”: There are diminishing returns on finding more ILP.
[Hennessy and Patterson 2007]
9. Old CW: Uniprocessor performance doubles every 18 months.
· New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall. Figure 2 plots
processor performance for almost 30 years. In 2006, performance is a factor of
three below the traditional doubling every 18 months that we enjoyed between
1986 and 2002. The doubling of uniprocessor performance may now take 5 years.
10. Old CW: Don’t bother parallelizing your application, as you can just wait a little
while and run it on a much faster sequential computer.
· New CW: It will be a very long wait for a faster sequential computer (see above).
11. Old CW: Increasing clock frequency is the primary method of improving
processor performance.
· New CW: Increasing parallelism is the primary method of improving processor
performance. (See Section 4.1.)
12. Old CW: Less than linear scaling for a multiprocessor application is failure.
· New CW: Given the switch to parallel computing, any speedup via parallelism is a
success.
James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com