ISCA 2009 Keynote I: How to Waste a Parallel Computer — Kathy Yelick

Title: Ten Ways to Waste a Parallel Computer

Speaker: Katherine Yelick

An excellent keynote talk at ISCA 2009 in Austin this morning. My rough notes follow:

· Moore’s law continues

o Frequency growth replaced by core count growth

· HPC has been working on this for more than a decade but HPC concerned as well

· New World Order

o Performance through parallelism

o Power is overriding h/w concern

o Performance is now a software concern

· What follows are Yellnick’s top 10 ways to waste a parallel computer

· #1: Build system with insufficient memory bandwidth

o Multicore puts us on the wrong side of the memory wall

o Key metrics to look at:

§ Memory size/bandwidth (time to fill memory)

§ Memory size * alg intensity / op-per-sec (time to process memory)

· #2: Don’t Take Advantage of hardware performance features

o Showed example of speedup from tuning nearest-neighbor 7 point stencil on a 3D array

o Huge gains but hard to do by hand. Need to do it automatically at code gen time.

· #3: Ignore Little’s Law

o Required concurrency = bandwidth * latency

o Observation is that most apps are running WAY less than full memory bandwidth [jrh: this isn’t because these apps aren’t memory bound. They are waiting on memory with small requests. Essentially they are memory request latency bound rather than bandwidth bound. They need larger requests or more outstanding requests]

o To make effective use of the machine, you need:

§ S/W prefetch

§ Pass memory around caches in some cases

· #4: Turn functional problems into performance problems

o Fault resilience introduces inhomogeneity in execution rates

o Showed a graph that showed ECC recovery rates (very common) but that the recovery times are substantial and the increased latency of correction is substantially slowing the computation. [jrh: more evidence that non-ECC designs such as current Intel Atom are not workable in server applications. Given ECC correction rates, I’m increasingly becoming convinced that non-ECC client systems don’t make sense.]

· #5: Over-Synchronize Applications

o View parallel executions as directed acyclic graphs of the computation

o Hiding parallelism in a library tends to over serialize (too many barriers)

o Showed work from Jack Dongarra on PLASMA as an example

· #6: Over-synchronize Communications

o Use a programming model in which you can’t utilize b/w or “low” latency

o As an example, compared GASNet and MPI with GASNet delivering far higher bandwidth

· #7: Run Bad Algorithms

o Algorithmic gains have far outstripped Moore’s law over the last decade

o Examples: 1) adaptive meshes rather than uniform, 2) sparse matrices rather than dense, and 3) reformulation of problem back to basics.

· #8: Don’t rethink your algorithms

o Showed examples of sparse iterative methods and optimizations possible

· #9: Choose “hard” applications

o Examples of such systems

§ Elliptic: stead state, global space dependence

§ Hyperbolic: time dependent, local space dependence

§ Parabolic: time dependent, global space dependence

o There is often no choice – we can’t just ignore hard problems

· #10: Use heavy-weight cores optimized for serial performance

o Used Power5 as an example of a poor design by this measure and show a stack of “better” performance/power

§ Power5:

· 389 mm^2

· 120W @ 1900 MHz

§ Intel Core2 sc

· 130 mm^2

· 15W @ 1000 MHz

§ PowerPC450 (BlueGene/P)

· 8mm^2

· 3W @ 850

§ Tensilica (cell phone processor)

· 0.8mm^2

· 0.09W @ 650W

o [jrh: This last point is not nearly well enough understood. Far too many systems are purchased on performance when they should be purchased on work done per $ and work done per joule.]

· Note: Large scale machines have 1 unrecoverable memory error (UME) per day [jrh: again more evidence that no-ECC server designs such as current Intel Atom boards simply won’t be acceptable in server applications, nor embedded, and with memory sizes growing evidence continues to mount that we need to move to ECC on client machines as well]

· HPC community shows that parallelism is key but serial performance can’t be ignored.

· Each factor of 10 increase in performance, tends to require algorithmic rethinks

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com

H:mvdirona.com | W:mvdirona.com/jrh/work | blog:http://perspectives.mvdirona.com

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.