Web Search Using Small Cores

I recently came across an interesting paper that is currently under review for ASPLOS. I liked it for two unrelated reasons: 1) the paper covers the Microsoft Bing Search engine architecture in more detail than I’ve seen previously released, and 2) it covers the problems with scaling workloads down to low-powered commodity cores clearly. I particularly like the combination of using important, real production workloads rather than workload models or simulations and using that base to investigate an important problem: when can we scale workloads down to low power processors and what are the limiting factors?

The paper: Web Search Using Small Cores: Quantifying the Price of Efficiency.

Low Power Project Team Site: Gargoyle: Software & Hardware for Energy Efficient Computing

I’ve been very interested in the application of commodity, low-power processors to produce service workloads for years and wrote up some of the work done during 2008 for the Conference on Innovative Data Systems Research in a paper CEMS: Low-Cost, Low-Power Servers for Internet-Scale Services and presentation. And several blog entries since that time:

This paper uses an Intel Atom as the low powered, commodity processor under investigation and compares it with Intel Harpertown. It would have been better to use Intel Nehalem as the server processor of comparison. Nehalem is a dramatic step forward in power/performance over Harpertown. But using Harpertown didn’t change any of the findings reported in the paper so it’s not a problem.

On the commodity, low-power end, Atom is a wonderful processor but current memory managers on Atom boards don’t support ECC nor greater than 4 gigabytes of memory. I would love to use Atom in server designs but all the data I’ve gathered argues strongly that no server workload should be run without ECC. Intel clearly has memory management units with the appropriate capabilities so it’s obviously not technical problems that leave Atom without ECC. The low-powered AMD part used in CEMS does include ECC as does the ARM I mentioned in the recent blog entry ARM Cortex-A9 SMP Design Announced.

Most “CPU bound” workloads are actually not CPU bound but limited by memory. The CPU will report busy but it is actually spending most of its time in memory wait states. How can you tell if your workload is actually memory bound or CPU bound? Look at Cycles Per Instruction, the number of cycles that each instruction takes. Super scalar processors should be dispatching many instructions per cycle (CPI << 1.0) but memory wait state on most workloads tend limit CPIs to over 1. Branch intensive workloads that touch large amounts of memory tend to have high CPI counts whereas cache resident workloads will be very low and potentially less than 1. I’ve seen operating system code with the CPI more than 7 and I’ve seen database systems in the 2.4 range. More optimistic folks than I, tend to look at the reciprocal of CPI, instructions per cycle but it’s the same data. See my Rules of Thumb post for more discussion of CPI.

In figure 1, the paper shows the instructions per Cycle (IPC which is 1/CPI) of Apache, MySQL, JRockit, DBench, and Bing. As I mentioned above, if you give server workloads sufficient disk and network resources, they typically become memory bound. A CPI of 2.0 or greater are typical of commercial server workloads and well over 3.0 is common. As we expected, all the public server workloads in Figure 1 are right around a CPI of 2.0 (IPC roughly equal to 0.5). Bing is the exception with a IPC CPI of nearly 1.0. This means that Bing is almost twice as computationally intensive than typical server workloads. This is an impressively good CPI and makes this workload particularly hard to run on low-power, low-cost, commodity processors. The authors choice of this very difficult workload to study allows them to clearly see the problems of scaling down server workloads and makes the paper better. Effectively using a difficult workload draws out and make more obvious the challenges of scaling down workloads to low-power processors. We need to keep in mind that most workloads, in fact, nearly all server workloads are a factor of 2 less computationally intensive and therefore easier to host on low-powered servers.

The lessons I got from the paper are: Unsurprisingly Atom showed much better power/performance than Harpertown but offered considerably less performance head room. Conventional server processors are capable of very high-powered bursts of performance but typically operate in lower performance states. When you need to run a short computational intensive segment of code, the performance is there. Low power processors operate in steady state nearer to their capabilities limits. The good news is they operate nearly an order of magnitude more efficiently than the high powered server processors but they don’t have the ability to deliver the computational bursts at the same throughput.

Given that low-powered processors are cheap, over-provisioning is the obvious first solution. Add more processors and run them at lower average utilization in order to have the headroom to be able to process computationally intensive code segments without slowdown. Over-provisioning helps with throughput and provides the headroom to handle computationally intensive code segments but doesn’t help with the latency problem. More cores will help most server workloads but, on those with both very high computational intensity (CPI near 1 or lower) and needing very low latency, only fast cores can fully address the problem. Fortunately, these workloads are not the common case.

Another thing to keep in mind is, if you improve the price/performance and power/performance of processors greatly, other server components begin to dominate. I like to look at extremes to understand these factors. What if the processor was free and consumed zero power? The power consumption of memory and glue chips would dominate and the cost of all the other components would put a floor on the server cost. This argues for at least 4 server design principles: 1) memory is on track to be the biggest problem so we need low cost, power efficient memories, 2) very large core counts help amortize the cost of all the other server components and helps manage the peak performance problem, 3) as the cost of the server is scaled down, it makes sense to share some components such as power supplies, and 4) servers will never be fully balanced (all resources consumed equally) for all workloads so we’ll need the ability to take resources to low-power states or even to depower them. Intel Nehalem does some of this later point and mobile phone processors like ARM are masters of it.

If you are interested in high scale search, the application of low-power commodity processors to service workloads, or both, this paper is a good read.

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

4 comments on “Web Search Using Small Cores
  1. FAWN is GREAT work! I’ve been watching and liking it for the last year. I should, and probably will, blog it in the near future.

    Thanks,

    –jrh
    jrh@mvdirona.com

  2. Anil Edakkunni says:

    FAWN won a Best Paper award at SOSP 09.

  3. I wrote IPC of nearly 1 when I meant CPI in the comment about Bing. Ironically, when near 1, both CPI and IPC are nearly the same but I made the update in the text above. Thanks Mihai.

    James Hamilton
    jrh@mvdirona.com

  4. Mihai Budiu says:

    Figure 1 shows the normalized IPC; the caption claims that "our version of search experiences more instruction level parallelism with an unnormalized IPC noticeably greater than one".

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.