Rules of thumb help us understand complex systems at a high level. Examples are that high performance server disks will do roughly 180 IOPS, or that enterprise system administrators can manage roughly 100 systems. These numbers ignore important differences between workloads and therefore can’t be precise, but they serve as a quick check. They ignore, for example, that web servers are MUCH easier to administer than database servers. Whenever I’m looking at a new technique, algorithm, or approach, I always start with the relevant rules of thumb and note the matches and differences. Where it differs, I look deeper. Sometimes I find great innovation and learn that the rules need to be updated to take into account the efficiencies of this new approach. But, more frequently, I find an error in the data, incorrect measurement technique, an algorithm that only works over a narrow set of workloads, or other restriction. Basically, a good repertoire of rules of thumb is useful in helping to find innovation and quickly spot mistakes.
Everyone does this at some level, although often they aren’t using formal rules but more of an informal gut feel. This gut feel helps people avoid mistakes and move more quickly without having to stare at each new idea and understand it from first principles. But there is a danger. Every so often, the rules change and if you don’t update your “gut feel” you’ll miss opportunities and new innovations.
Over the years, I’ve noticed that the duration from the first breakthrough idea on a topic to it actually making sense and having broad applicability is 7 to 10 years. The earliest research work is usually looking beyond current conditions and when the ideas are first published, we usually don’t know how to employ them, don’t yet have efficient algorithms, or the breadth of problems solved by the new ideas is not yet sufficiently broad. It usually takes 7 to 10 years to refine and generalize an idea to the point where it is correct and broadly useable.
Now you would think that once an idea “makes sense” broadly, once it’s been through it’s 7 to 10 years of exile, it would be ready for broad deployment. Ironically, there is yet one more delay and this one has been the death of many startups and a great many development projects. Once an idea is clearly “correct”, applicable, and broadly generalized, enterprise customers still typically won’t buy it for 5 to 7 years. What happens is that the new idea, product, or approach violates their rules of thumb and they simply won’t buy it until the evidence builds over time and they begin to understand that the rules have changed. Some examples:
· Large memories: In the early ’90s it became trivially true that very large memories were the right answer for server-side workloads, especially database workloads. The combination of large SMPs and rapidly increasing processor performance coupled with lagging I/O performance and falling memory prices made memory a bargain. You couldn’t afford to not buy large memories and yet many customers I was working with at the time were much more comfortable buying more disk and more CPU even though it was MORE expensive and less effective than adding memory. They were trapped in their old rules of thumb on memory cost vs value.
· Large SMPs: In the late ’80s customers were still spending huge sums of money buying very high end water cooled ECL mainframes when they should have been buying the emerging commodity UNIX SMPs and saving a fortune. It takes a while for customers and the market as a whole to move between technologies.
· Large clusters: in the late ’90s and to a lesser extent even to this day, customers often buy very large SMPs when they should be buying large clusters of commodity servers. Sure, they have to rewrite software systems to run in this environment but, even with those costs, in large deployments, mammoth savings are possible. It’ll take time before they are comfortable and it’ll take time before they have software that’ll run in the new environment.
Basically, once an idea becomes “true” it still has 5 to 7 more years before it’s actually in broad use. What do we learn from this? First, it’s very easy to be early and to jump on an idea before it’s time. The dot com era was full of failures even though the same idea will actually succeed in the hands of a new startup over the next couple of years (5 to 7 year delay). It’s hard to have the right amount of patience and it’s very hard to sell new ideas when they violate the current commonly held rules of thumb. The second thing we learn is to check our rules of thumb more frequently and to get good at challenging the rules of thumb or gut feels of others when trying to get new ideas adopted. Understand that some of our rules might no longer be true and that some of the rules of thumb used by the person you are speaking with may be outdated.
Here are four rules of thumb, all of which were inarguably true at one point in time, and each is either absolutely not true today or on the way to being broken:
· Compression is a lose in an OLTP system: This is a good place to start since compression is a clear and obvious win today. Back in 1990 or thereabout I argued strongly against adding compression to DB2 UDB (where I was lead architect at the time). At the time, a well tuned OLTP system was CPU bound and sufficient I/O devices had been added to max out the CPU. The valuable resource was CPU in that you could always add more disk (at a cost) but you can’t just add more CPUs. At the time, 4-way to 8-way systems were BIG and CPUs were 100x slower than they are today. Under those conditions, it would be absolutely nuts to trade off CPU for a reduction in I/O costs for the vast majority of workloads. Effectively we would be getting help with a solvable problem and, in return, accepting more of an unsolvable problem. I was dead against it at the time and we didn’t do it then. Today, compression is so obviously the right answer it would be nuts not to do it. Most very large scale services are running their OLTP systems over clusters of many databases servers. They have CPU cycles to burn but I/O is what they are short of and where the costs are. Any trick that can reduce I/O consumption is worth considering and compression is an obvious win. In fact, compression is now making sense higher up the memory hierarchy and there are times when it makes sense to leave data compressed in memory only decompressing when needed rather than when first brought in from disk. Compression is obviously a win in high end OLTP systems and beyond and, as more customers move to clusters and multi-core systems this will be just keep getting more clear.
· Bottlenecks are either I/O, CPU, or inter-dispatch unit contention: I love performance problems and have long used the magic three as a way of getting a high level understanding of what’s wrong with a complex system that is performing poorly. The first thing I try to understand when looking at a poorly performing system is whether the system is CPU bound, I/O bound (network, disk, or UI if there is one), or contention bound (processes or threads blocked on other processes or threads – basically, excess serialization). Looking at these three has always been a gross simplification but it’s been a workable and useful simplification and has many times guided me to the source of the problem quickly. Increasingly, these three are inadequate in describing what’s wrong with a slow system in that memory bandwidth/contention is increasingly becoming the blocker. Now in truth, memory bandwidth has always been a factor but it typically wasn’t the biggest problem. Cache conscious algorithms attempt to solve the memory bandwidth/contention problem but how do you measure whether you have a problem or not? How do you know if the memory and/or cache hierarchy is the primary blocking factor?
The simple answer is actually very near to the truth but not that helpful: it probably IS the biggest problem. Memory bandwidth/contention is becoming at least the number two problem for many apps only behind I/O contention. And, for many, it’s the number one performance issue. How do we measure it quickly and easily and understand the magnitude of the problem? I use cycles per instructions as an approximation. CPUs are mostly super-scalar which means they can retire (complete) more than one instruction per clock cycle. What I like to look at is rate that instructions are retired. This gives a rough view of how much time is being wasted in pipeline stalls most of which are caused by waiting on memory. As an example to give a view for the magnitude of this problem, note that most CPUs I’ve worked with over the last 15 years can execute more than one instruction per cycle and yet I’ve NEVER seen it happen running a data-intensive, server-side workload. Good database systems will run in the 2.0 to 2.5 cycles per instruction (CPI) and I’ve seen operating system traces that were as bad as 7.5 cycles per instruction. Instead of executing multiple instructions per cycle, we are typically only executing fractions of instructions each cycle. So, another rule of thumb is changing: you now need to look at the CPI of your application in addition to the big three if you care deeply about performance.
· CPU cycles are precious: This has always been true and is still true today but it’s become less true fast. 10 years ago, most systems I used, whether clients or servers, were CPU bound. It’s rare to find a CPU bound client machine these days. Just about every laptop in the world is now I/O bound. Servers are quickly going the same route and many are already there. CPU performance increases much faster than I/O performance so this imbalance will continue to worsen. As the number of cores per socket climbs, this imbalance will climb at an accelerated pace. We talked above about compression making sense today. That’s because CPU cycles are no longer the precious resource. Instead of worrying about path length, we should be spending most of our time reducing I/O and memory references. CPU cycles are typically not the precious resource and multi-core will accelerate the devaluation of CPU resources over memory and I/O resources.
· Virtual Machines can’t be used in services: I chose this as the last example as it is something that I’ve said in the last 12 months and yet it’s becoming increasingly clear that this won’t stay true for long. It’s worth looking more closely at this one. First, the argument behind virtual machines not making sense in the services world is this: when you are running 10s to 1000s of systems of a given personality, why mix personalities on the same box? Just write the appropriate image to as many systems as you need and run only a single application personality per server. To get the advantages of dynamic resource management often associated with using virtual machines, take a system running one server personality and re-image it to another. It’s easy to move the resources between roles. In effect you get all the advantages without the performance penalty of virtualization. And, by the way, it’s the virtualization penalty that really drives this point home. I/O intensive workloads often run as badly as 1/5 the speed (20% of native) when running in a virtual machine and ½ the throughput is a common outcome. The simple truth is that the benefits of using VMs in high scale services are too easy to obtain using other not quite as elegant techniques and the VM overhead at scale is currently unaffordable. Who could afford to take a 20,000 system cluster and, post virtualization, have to double the server count to 40,000? It’s simply too large a bill to pay today. What would cause this to ever become interesting? Why would we ever run virtual machines in a service environment?
At a factor of two performance penalty, it’s unlikely to make sense in large services (hardware is the number one cost in services — well ahead of all others – I’ll post the detailed breakdown sometime) so doubling this comes to close to adding 25% to the overall service cost. Unacceptable. When do they make sense is:
1. When running non-trusted code, some form of isolation layer is needed. ASP and the .Net Common Language Runtime are viable isolation boundaries but, for running arbitrary code, VMs are an excellent choice. That’s what Amazon EC2 uses for isolation. Another scenario used by many customers is to run lightweight clients and centrally host the client side systems in the data center. For these types of scenarios, VMs are a great answer.
2. When running very small services. Large services need 100s to 10s of thousands of systems so sharing servers isn’t very interesting. However, for every megaservice, there are many very small services some of which could profit from server resource sharing.
3. Hardware independence. One nice way to get real hardware independence and to enforce it is to go with a VM. I wouldn’t pay a factor of two overhead to get that benefit but, for 10% it probably would make sense. You could pay for that overhead but reduction in operational complexity which lowers costs, brings increased reliability, and allows more flexible and effective use of hardware resources.
As a final observation in support of VMs in the longer term, I note that the resource they are wasting in largest quantity today is the resource that is about to be the most plentiful looking forward. Multi-core, many core, and the continuing separation of compute performance vs I/O performance, will make VMs a great decision and a big part of the future service landscape.
This last example is perhaps the most interesting example of a changing rule of thumb in that it’s one that is still true today. In mega-deployments, VMs aren’t worth the cost. However, it looks like this is very unlikely to stay true. As VM overhead is reduced and the value of the squandered CPU resources continues to fall, VMs will look increasingly practical in the service world. It’s a great example of a rule of thumb that is about to be repealed.
Rules of thumb help us quickly get a read on a new idea, algorithm or approach. They are a great way to quickly check a new idea for reasonableness. But, they don’t stay true forever. Ratios change over time. Make sure you are re-checking your rules every year or so and, when selling new ideas, be aware the person you are talking to may be operating under an outdated set of rules. Bring the data to show which no longer apply, or you’ll be working harder than you need to.
–jrh
James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com