Thursday, October 01, 2009

Microsoft’s Chicago data center was just reported to be online as of July 20th. Data Center Knowledge published an interesting and fairly detailed report in: Microsoft Unveils Its Container-Powered Cloud. 

 

Early industry rumors were that Rackable Systems (now SGI but mark me down as confused on how that brand change is ever going to help the company) had won the container contract for the lower floor of Chicago. It appears that the Dell Data Center Solutions team has now has the business and 10 of the containers are from DCS.

 

The facility is reported to be a ½ billion dollar facility of 700,000 square feet. The upper floor is a standard data center whereas the lower floor is the world’s largest containerized deployment. Each container holds 2,000 servers and ½MW of critical load. The entire lower floor when fully populated will house 112 containers and 224,000 servers.

 

Data Center Knowledge reports:

 

The raised-floor area is fed by a cooling loop filled with 47-degree chilled water, while the container area is supported by a separate chilled water loop running at 65 degrees. Of the facility’s total 30-megawatt power capacity, about 20 megawatts is dedicated to the container area, with about 10 megawatts for the raised floor pods. The power infrastructure also includes 11 power rooms and 11 diesel generators, each providing 2.8 megawatts of potential backup power that can be called upon in the event of a utility outage.

 

Unlike Dublin which uses a very nice air-side economization design, Chicago is all water cooled with water side economization but no free air cooling at all.

 

One of the challenges of container systems is container handling. These units can weight upwards of 50,000 lbs and are difficult to move and the risk of a small mistake by a crane operator is substantial not to mention the cost of gantry cranes to move them around. The Chicago facility takes a page from advanced material handling and slides the containers on air skates over the polished concrete floor. Just 4 people can move a 2 container stack into place. It’s a very nice approach.

 

The entire facility is reported to be 30MW total load but 112 containers would draw 56MW critical load. So we know the 30MW number is an incremental build-out point rather than the facility's fully built size. Once completed, I would estimate it will be closer to 80MW of critical load and around 110MW of total power (assuming 1.35 PUE).

 

                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Thursday, October 01, 2009 6:01:09 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Hardware
 Sunday, September 27, 2009

I recently came across an interesting paper that is currently under review for ASPLOS. I liked it for two unrelated reasons: 1) the paper covers the Microsoft Bing Search engine architecture in more detail than I’ve seen previously released, and 2) it covers the problems with scaling workloads down to low-powered commodity cores clearly. I particularly like the combination of using important, real production workloads rather than workload models or simulations and using that base to investigate an important problem: when can we scale workloads down to low power processors and what are the limiting factors?

 

The paper: Web Search Using Small Cores: Quantifying the Price of Efficiency.

Low Power Project Team Site: Gargoyle: Software & Hardware for Energy Efficient Computing

 

I’ve been very interested in the application of commodity, low-power processors to produce service workloads for years and wrote up some of the work done during 2008 for the Conference on Innovative Data Systems Research in a paper CEMS: Low-Cost, Low-Power Servers for Internet-Scale Services and presentation. And several blog entries since that time:

This paper uses an Intel Atom as the low powered, commodity processor under investigation and compares it with Intel Harpertown. It would have been better to use Intel Nehalem as the server processor of comparison. Nehalem is a dramatic step forward in power/performance over Harpertown. But using Harpertown didn’t change any of the findings reported in the paper so it’s not a problem.

 

On the commodity, low-power end, Atom is a wonderful processor but current memory managers on Atom boards don’t support ECC nor greater than 4 gigabytes of memory. I would love to use Atom in server designs but all the data I’ve gathered argues strongly that no server workload should be run without ECC. Intel clearly has memory management units with the appropriate capabilities so it’s obviously not technical problems that leave Atom without ECC. The low-powered AMD part used in CEMS does include ECC as does the ARM I mentioned in the recent blog entry ARM Cortex-A9 SMP Design Announced.

 

Most “CPU bound” workloads are actually not CPU bound but limited by memory. The CPU will report busy but it is actually spending most of its time in memory wait states. How can you tell if your workload is actually memory bound or CPU bound?  Look at Cycles Per Instruction, the number of cycles that each instruction takes. Super scalar processors should be dispatching many instructions per cycle (CPI << 1.0) but memory wait state on most workloads tend limit CPIs to over 1.  Branch intensive workloads that touch large amounts of memory tend to have high CPI counts whereas cache resident workloads will be very low and potentially less than 1.  I’ve seen operating system code with the CPI more than 7 and I’ve seen database systems in the 2.4 range. More optimistic folks than I, tend to look at the reciprocal of CPI, instructions per cycle but it’s the same data. See my Rules of Thumb post for more discussion of CPI.

 

In figure 1, the paper shows the instructions per Cycle (IPC which is 1/CPI) of Apache, MySQL, JRockit, DBench, and Bing.  As I mentioned above, if you give server workloads sufficient disk and network resources, they typically become memory bound. A CPI of 2.0 or greater are typical of commercial server workloads and well over 3.0 is common. As we expected, all the public server workloads in Figure 1 are right around a CPI of 2.0 (IPC roughly equal to 0.5).  Bing is the exception with a IPC CPI of nearly 1.0. This means that Bing is almost twice as computationally intensive than typical server workloads. This is an impressively good CPI and makes this workload particularly hard to run on low-power, low-cost, commodity processors. The authors choice of this very difficult workload to study allows them to clearly see the problems of scaling down server workloads and makes the paper better. Effectively using a difficult workload draws out and make more obvious the challenges of scaling down workloads to low-power processors. We need to keep in mind that most workloads, in fact, nearly all server workloads are a factor of 2 less computationally intensive and therefore easier to host on low-powered servers.

 

The lessons I got from the paper are: Unsurprisingly Atom showed much better power/performance than Harpertown but offered considerably less performance head room. Conventional server processors are capable of very high-powered bursts of performance but typically operate in lower performance states. When you need to run a short computational intensive segment of code, the performance is there.  Low power processors operate in steady state nearer to their capabilities limits. The good news is they operate nearly an order of magnitude more efficiently than the high powered server processors but they don’t have the ability to deliver the computational bursts at the same throughput.

 

Given that low-powered processors are cheap, over-provisioning is the obvious first solution. Add more processors and run them at lower average utilization in order to have the headroom to be able to process computationally intensive code segments without slowdown. Over-provisioning helps with throughput and provides the headroom to handle computationally intensive code segments but doesn’t help with the latency problem.  More cores will help most server workloads but, on those with both very high computational intensity (CPI near 1 or lower) and needing very low latency, only fast cores can fully address the problem. Fortunately, these workloads are not the common case.

 

Another thing to keep in mind is, if you improve the price/performance and power/performance of processors greatly, other server components begin to dominate. I like to look at extremes to understand these factors.  What if the processor was free and consumed zero power?  The power consumption of memory and glue chips would dominate and the cost of all the other components would put a floor on the server cost. This argues for at least 4 server design principles: 1) memory is on track to be the biggest problem so we need low cost, power efficient memories, 2) very large core counts help amortize the cost of all the other server components and helps manage the peak performance problem, 3) as the cost of the server is scaled down, it makes sense to share some components such as power supplies, and 4) servers will never be fully balanced (all resources consumed equally) for all workloads so we’ll need the ability to take resources to low-power states or even to depower them.  Intel Nehalem does some of this later point and mobile phone processors like ARM are masters of it.

 

If you are interested in high scale search, the application of low-power commodity processors to service workloads, or both, this paper is a good read.

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, September 27, 2009 7:21:58 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Hardware
 Thursday, September 24, 2009

This is 100% the right answer: Microsoft’s Chiller-less Data Center. The Microsoft Dublin data center has three design features I love: 1) they are running evaporative cooling, 2) they are using free-air cooling (air-side economization), and 3) they run up to 95F and avoid the use of chillers entirely. All three of these techniques were covered in the best practices talk I gave at the Google Data Center Efficiency Conference  (presentation, video).

 

Other blog entries on high temperature data center operation:

·  Next Point of Server Differentiation: Efficiency at Very High Temperature

·  Costs of Higher Temperature Data Centers?

·  32C (90F) in the Data Center

 

Microsoft General Manager of Infrastructure Services Arne Josefsberg blog entry on the Dublin facility: http://blogs.technet.com/msdatacenters/archive/2009/09/24/dublin-data-center-celebrates-grand-opening.aspx.

 

In a secretive industry like ours, it’s good to see a public example of a high-scale data center running hot and without chillers. Good work Microsoft.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Thursday, September 24, 2009 10:37:27 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Hardware
 Monday, September 21, 2009

Here’s another innovative application of commodity hardware and innovative software to the high-scale storage problem. MaxiScale focuses on 1) scalable storage, 2) distributed namespace, and 3) commodity hardware.

 

Today's announcement: http://www.maxiscale.com/news/newsrelease/092109.

 

They sell software designed to run on commodity servers with direct attached storage. They run N-way redundancy with a default of 3-way across storage servers to be able to survive disk and server failure. The storage can be accessed via HTTP or via Linux or Windows (2003 and XP) file system calls. The later approach requires a kernel installed device driver and uses a proprietary protocol to communicate back with the filer cluster but has the advantage of directly support local O/S read/write operations. MaxiScale architectural block diagram:

Overall I like the approach of using commodity systems with direct attached storage as the building block for very high scale storage clusters but that is hardly unique. Many companies have head down this path and, generally, it’s the right approach. What caught my interest when I spoke to the MaxiScale team last week was: 1) distributed metadata, 2) MapReduce support, and 3) small file support. Let’s look at each of these major features:

 

Distributed Metadata

File systems need to maintain a namespace. We need to maintain the directory hierarchy and we need to know where to find the storage blocks that make up the file.  In addition, other attributes and security may need to be stored depending upon the implemented file system semantics. This metadata is often stored in large key/value store. The metadata requires at least some synchronization since, for example, you don’t want to create two different objects of the same name at roughly the same time.  At high scale, storage servers will be joining and leaving the cluster all the time, so having a central metadata service is a easy approach to the problem. But, as easy as it is to implement a central metadata systems, they bring scaling limits. Eventually the metadata gets too hot and needs to be partitioned. In fairness, it’s amazing how far central metadata can be scaled but eventually hot spots develop and it needs to be partitioned. For example, Google GFS just went down this path: GFS: Evolution on Fast-forward. Partitioning metadata is a fairly well understood problem. What makes it a bit of a challenge is making the metadata system adaptive and able to re-partition when hot spots develop.

 

MaxiScale took an interesting approach to scaling the metadata. They distributed the metadata servers over the same servers that store the data rather than implement a cluster of dedicated metadata servers. They do this by hashing on the parent directory to find what they call a Peer Set and then, in that particular Peer Set, they look up the object name in the metadata store, find the file block location, and then apply the operation to the blocks in that same Peer Set.

 

Distributing the metadata over the same Peer Set as the stored data means that each peer set is independent and self-describing. The downside of having a fixed hash over the peer sets is that it’s difficult to cool down an excessively hot peer set by moving objects since the hash is known by all clients.

 

MapReduce Support

I love the MaxiScale approach to multi-server administration. They need to provide customers the ability to easily maintain multi-server clusters. They could have implemented a separate control plane to manage the all the servers that make up the cluster but, instead, they just use Hadoop and run MapReduce jobs.

 

All administrative operations are written as simple MapReduce jobs which certainly made the implementation task easier but it’s also a nice, extensible interface to allow customers to write custom administrative operations. And, since MapReduce is available over the cluster, its super easy to write data mining and data analysis jobs. Supporting MapReduce over the storage system is a nice extension of normal filer semantics.

 

Small File Support

The standard file system access model is to probe the metadata to get the block list and then access the blocks to perform the file operation. In the common case, this will require two I/Os which is fine for large files but the two I/Os can be expensive for small file access. And, since most filers use fixed size blocks and, for efficiency these block size tend to be larger than the average small file, some space is wasted. The common approach to this two problems is to pull small files “up” and rather than store the list of storage blocks in the file metadata, just store the small file. This works fine for small files and avoids both the block fragmentation and the multiple I/O problem on small files. This is what MaxiScale has done as well and they claim single I/O for any small file stored in the system.

 

More data on the MaxiScale filer: Small Files, Big Headaches: Ensuring Peak Performance

 

I love solutions based upon low-cost, commodity H/W and application maintained redundancy and what MaxiScale is doing has many of the features I would like to see in a remote filer.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

 

Monday, September 21, 2009 6:41:49 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Software
 Wednesday, September 16, 2009

ARM just announced a couple of 2-core SMP design based upon the Cortex-A9 application processor, one optimized for performance and the other for power consumption (http://www.arm.com/news/25922.html). Although the optimization points are different, both are incredibly low power consumers by server standards with the performance-optimized part dissipating only 1.9W at 2Ghz based upon the TSMC 40G process (40nm). This design is aimed at server applications and should be able to run many server workloads comfortably.

 

In Linux/Apache on ARM Processors I described an 8 server cluster of web servers running the Marvell MV78100. These are single core ARM design servers produced by Marvell. It’s a great demonstration system showing that web server workloads can be run cost effectively on ARM based servers. Toward the end of the blog entry, I observed:

 

The ARM is a clear win on work done per dollar and work done per joule for some workloads. If a 4-core, cache coherent version was available with a reasonable memory controller, we would have a very nice server processor with record breaking power consumption numbers.

 

I got a call from ARM soon after posting saying that I may get my wish sooner than I was guessing. Very cool. The Design that was announced earlier today includes a 2-core, performance optimized design that could form the building block of a very nice server. In the following block diagram, ARM  shows a pair of 2-core macros implementing a 4-way SMP:

Some earlier multi-core ARM designs such the Marvel MV78200 are not cache coherent which makes it difficult to support a single application utilizing both cores. As long as this design is coherent (and I believe it is), I love it.  

 

Technically it’s long been possible to build N-way SMP servers based upon the single core Cortex-A9 macros but it’s quite a bit of design work. The 2-way single macro makes it easy to deliver at least 2-core servers and this announcement shows that ARM is interested in and is investing in developing the ARM-based server market.

 

The ARM reported performance results:

 

In the ARM business model, the release of a design is the first and most important step towards parts becoming available from partners. However, it’s typically at least 12 months from design availability to first shipping silicone from partners so we won’t likely see components based upon this design until late 2010 at the earliest. I’m looking forward to it.

 

Our industry just keeps getting more interesting.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Wednesday, September 16, 2009 4:05:24 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Hardware
 Sunday, September 13, 2009

AJAX applications are wonderful because they allow richer web applications with much of the data being brought down asynchronously. The rich and responsive user interfaces of applications like Google Maps and Google Docs are excellent but JavaScript developers need to walk a fine line. The more code they download, the richer the UI they can support and the less synchronous server interactions they need. But, the more code they download, the slower the application can be to start. This is particularly noticeable when the client cache is cold and in mobile applications with restricted bandwidth back to the server.

 

Years ago profile directed code reorganization (a sub-class of Basic Block Transforms) were implemented to solve what might appear to be an unrelated problem. The problem tackled by these profile directed basic block reorganizations is decreasing the number of last level cache misses in a server. They do this by organizing frequently accessed code segments together and moving rarely executed code segments. The biggest gain is that seldom executed error handling code can be moved away from frequently executed application code. I’ve seen reports of error handling code making up more than 40% of an application. Moving this code away from the commonly executed mainline code allows fewer processor cache lines to support program execution which demands fewer memory faults. Error handling code will execute more slowly but that is seldom an issue. Profile directed basic block transforms need to be trained on “typical” applications workloads and code that typically executes together will be placed together. Unfortunately, “typical” is often an important, industry standard benchmark like TPC-C so sometimes “typical” is replaced by “important” :-). Nonetheless, the tools are effective and greater than 20% improvement is common and we often see much more. All commercial database servers use or misuse profile directed basic block reorganizations.

 

The JavaScript download problem is actually very similar to the problem addressed by basic block transforms. Getting code from the server takes relatively long time just as getting code from memory takes a long time relative to executing code already in the processor cache.  Much of the application doesn’t execute in the common case so it makes little sense to download it all unless needed in this execution. Most of the code isn’t needed to start the application so it’s a big win to download the code, start the application, and then download what is needed in the background.

 

Last week Ben Livshits and Emre Kiciman of the Microsoft Research team released an interesting tool that does exactly this for JavaScript applications. Doloto analyses client JavaScript systems and breaks them up into a series of independent modules. The primary module is downloaded first and includes just stubs for the other modules. This primary module is smaller, downloads faster, and dramatically improves time to live application. In the Doloto team measurements, the size of the initial download was only between 20% and 60% of the size of the standard download. In the case of Google docs, the initial download was less than 20% of the original size.

Once the initial module is downloaded, the application is live and running and the rest of the modules are brought down asynchronously or faulted in as needed. Many applications due these optimizations manually but this is a nice automated approach to the problem

 

I’ve seen 80,000 line JavaScript programs and there are many out there far larger. Getting the application running fast dramatically improves the user experience and this is a nice approach to achieving that goal.  Doloto is available for download at: http://msdn.microsoft.com/en-us/devlabs/ee423534.aspx. And there is a more detailed Doloto paper at: http://research.microsoft.com/en-us/um/people/livshits/papers/pdf/fse08.pdf and summary information at: http://research.microsoft.com/en-us/projects/doloto/.   

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, September 13, 2009 9:03:46 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Sunday, September 06, 2009

In The Case for Low-Cost, Low-Power Servers, I made the argument that the right measures of server efficiency was work done per dollar and work done per joule. Purchasing servers on single dimensional metrics like performance or power or even cost alone, makes no sense at all. Single dimensional purchasing leads to micro-optimizations that push one dimension to the detriment of others. Blade servers have been one of my favorite examples of optimizing the wrong metric (Why Blade Servers aren’t the Answer to All Questions). Blades often trade increased cost to achieve server density. But density doesn’t improve work done per dollar nor does it produce better work done per joule. In fact, density often takes work done per joule in the wrong direction by driving higher power consumption due to the challenge of cooling higher power densities.

 

There is no question that selling in high volume drives price reductions so client and embedded parts have the potential to be the best price/performing components.  And, as focused as the server industry has been on power of late, the best work is still in the embedded systems world where a cell phone designer would sell their souls for a few more amp-hours if they could have it without extra size or extra-weight.  Nobody focuses on power as much as embedded systems designers and many of the tricks arriving in the server world showed up years ago in embedded devices.  

 

A very common processor used in cell phone applications is the ARM. The ARM business is model is somewhat unusual in that they sells a processor design and then the design is taken and customized by many teams including Texas Instruments, Samsung, and Marvel. These processors find their way into cell phones, printers, networking gear, low-end Storage Area Networks, Network Attached Storage devices, and other embedded applications. The processors produce respectable performance and great price/performance and absolutely amazing power/performance.

 

Could this processor architecture be used in server applications? The first and most obvious push back is that it’s a different instruction set architecture but servers software stacks really are not that complex.  If you can run Linux and Apache some web workloads can be hosted. There are many Linux ports to ARM -- the software will run. The next challenge, and this one is the hard one, does the workload partition into sufficiently fine slices to be hosted on servers built using low end processors. Memory size limitations are particularly hard to work around in that ARM designs have the entire system on the chip including the memory controller and none I’ve seen address more than 2GB. But, for those workloads that do scale sufficiently finely, ARM can work.

 

I’ve been interested in seeing this done for a couple of years and have been watching ARM processors scale up for quite some time. Well, we now have an example. Check out http://www.linux-arm.org/Main/LinuxArmOrg. That web site is hosted on 7 servers, each running the following:

·         Single 1.2Ghz ARM processor, Marvell MV78100

·         1 disk

·         1.5 GB DDR2 with ECC!

·         Debian Linux

·         Nginx web proxy/load balancer

·         Apache web server

 

Note that, unlike Intel Atom based servers, this ARM-based solution has the full ECC memory support we want in server applications (actually you really want ECC in all applications from embedded through client to servers).

 

Clearly this solution won’t run many server workloads but it’s a step in the right direction. The problems I have had when scaling systems down to embedded processors have been dominated by two issues: 1) some workloads don’t scale down to sufficiently small slices (what I like to call bad software but, as someone who spent much of his career working on database engines, I probably should know better), and 2) surrounding component and packaging overhead. Basically, as you scale down the processor expense, other server costs begin to dominate. For example, If you half the processor cost and also ½ the throughput, its potentially a step backwards since all the other components in the server didn’t also half in cost. So, in this example, you would get ½ the throughput with something more than ½ the cost. Generally not good. But, what’s interesting are those cases where it’s non-linear in the other direction. Cut the cost to N% with throughput at M% where M is much more than N. As these system on a chip (SOC) server solutions improve, this is going to be more common.

 

It’s not always a win based upon the discussion above but it is a win for some workloads today. And, if we can get multi-core versions of ARM, it’ll be a clear win for many more workloads. Actually, the Marvel MV78200 actually is a two core SOC but it’s not cache coherent which isn’t a useful configuration in most server applications. 

 

The ARM is a clear win on work done per dollar and work done per joule for some workloads. If a 4-core, cache coherent version was available with a reasonable memory controller, we would have a very nice server processor with record breaking power consumption numbers. Thanks for the great work ARM and Marvel. I’m looking forward to tracking this work closely and I love the direction its taking. Keep pushing.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, September 06, 2009 4:19:41 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Thursday, September 03, 2009

The server tax is what I call the mark-up applied to servers, enterprise storage, and high scale networking gear.  Client equipment is sold in much higher volumes with more competition and, as a consequence, is priced far more competitively. Server gear, even when using many of the same components as client systems, comes at a significantly higher price. Volumes are lower, competition is less, and there are often many lock-in features that help maintain the server tax.  For example, server memory subsystems support Error Correcting Code (ECC) whereas most client systems do not. Ironically both are subject to many of the same memory faults and the cost of data corruption in a client before the data is sent to a server isn’t obviously less than the cost of that same data element being corrupted on the server. Nonetheless, server components typically have ECC while commodity client systems usually do not. 

 

Back in 1987 Garth Gibson, Dave Patterson, and Randy Katz invented Redundant Array of Inexpensive Disks (RAID). Their key observation was that commodity disks in aggregate could be more reliable than very large, enterprise class proprietary disks. Essentially they showed that you didn’t have to pay the server tax to achieve very reliable storage. Over the years, the “inexpensive” component of RAID was rewritten by creative marketing teams as “independent” and high scale RAID arrays are back to being incredibly expensive. Large Storage Area Networks (SANs) are essentially RAID arrays of “enterprise” class disk, lots of CPU and huge amounts of cache memory with a fiber channel attach. The enterprise tax is back with a vengeance and an EMC NS-960 prices in at $2,800 a terabyte.

 

BackBlaze, a client compute backup company, just took another very innovative swipe at destroying the server tax on storage.  Their work shows how to bring the “inexpensive” back to RAID storage arrays and delivers storage at $81/TB. Many services are building secret, storage subsystems that deliver super reliable storage at very low cost.  What makes the BackBlaze work unique is they have published the details on how they built the equipment. It’s really very nice engineering.

 

In Petabytes on a budget: How to Build Cheap Cloud Storage they outline the details of the storage pod:

·         1 storage pod per 4U of standard rack space

·         1 $365 mother board and 4GB of ram per storage pod

·         2 non-redundant Power Supplies

·         4 SATA cards

·         Case with 6 fans

·         Boot drive

·         9 backplane multipliers

·         45 1.5 TB commodity hard drives at $120 each.

 

Each storage pod runs Apache TomCat 5.5 on Debian Linux and implements 3 RAID6 volumes of 15 drives each.  They provide a hardware full bill of materials in Appendix A of Petabytes on a budget: How to Build Cheap Cloud Storage.

 

Predictably some have criticized the design as inappropriate for many workloads and they are right. The I/O bandwidth is low so this storage pod would be a poor choice for data intensive applications like OLTP databases. But, it’s amazingly good for cold storage like the BackBlaze backup application. Some folks have pointed out that the power supplies are very inefficient at around 80% peak efficiency and the configuration chosen will have them far below peak efficiency. True again but it wouldn’t be hard to replace these two PSUs with a single, 90+% efficiency, commodity unit. Many are concerned with cooling and vibration. I doubt cooling is an issue and, in the blog posting, they addressed the vibration issue and talked briefly about how they isolated the drives. The technique they chose might not be adequate for high IOPS arrays but it seems to be working for their workload. Some are concerned by the lack of serviceability in that the drives are not hot swappable and the entire 67TB storage pod has to be brought offline to do drive replacements. Again, this concern is legitimate but I’m actually not a big fan of hot swapping drives – I always recommend bringing down a storage server before service (I hate risk and complexity). And, I hate paying for hot swamp gear and there isn’t space for hot swap in very high density designs.  Personally, I’m fine with a “shut-down to service” model but others will disagree.

 

The authors compared their hardware storage costs to a wide array of storage sub-systems from EMC through Sun and Netapp. They also compared to Amazon S3 and made what is a fairly unusual mistake for a service provider. They compared on-premise storage equipment purchase cost (just the hardware) with a general storage service. The storage pod costs include only hardware while the S3 costs include data center rack space, power for the array, cooling, administration, inside the data center networking gear, multi-data center redundancy, a general I/O path rather than one only appropriate for cold storage, and all the software to support a highly reliable, geo-redundant storage service. So I’ll quibble on their benchmarking skills – the comparison is of no value as currently written -- but, on the hardware front, it’s very nice work.

 

Good engineering and a very cool contribution to the industry to publish the design. One more powerful tool to challenge the server tax. Well done Backblaze.

 

VentureBeat article: http://venturebeat.com/2009/09/01/backblaze-sets-its-cheap-storage-designs-free/.


 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Thursday, September 03, 2009 8:13:05 AM (Pacific Standard Time, UTC-08:00)  #    Comments [11] - Trackback
Hardware
 Friday, August 28, 2009

We’re back from China last Saturday night and, predictably, I’m swamped catching up on three weeks worth of queued work.  The trip was wonderful (China Trip) but it’s actually good to be back at work. Things are changing incredibly quickly industry-wide and it’s a fun time to be part of AWS.

 

An AWS feature I’ve been looking particularly looking forward to seeing announced is Virtual Private Cloud (VPC). It went into private beta two nights back. VPC allows customers to extend their private networks to the cloud through a virtual private network (VPN) to access their Amazon Web Service Elastic Compute Cloud (EC2) instances with the security they are used to having on their corporate networks. This one is a game changer. 

 

Virtual Private Cloud news coverage: http://news.google.com/news/search?pz=1&ned=us&hl=en&q=amazon+virtual+private+cloud.

 

Werner Vogels on VPC: Seamlessly Extending the Data Center – Introducing Amazon Virtual Private Cloud.

 

With VPC, customers can have applications running on EC2 “on” their private corporate networks and accessible only from their corporate networks just like any other locally hosted application.  This is important because it makes it easier to put enterprise applications in the cloud and support the same access right and restrictions that customers are used to enforcing on locally hosted resources. Applications can more easily move between private, enterprise data centers and the cloud and hybrid deployments are easier to create and more transparent.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Friday, August 28, 2009 7:07:48 AM (Pacific Standard Time, UTC-08:00)  #    Comments [5] - Trackback
Services
 Saturday, August 01, 2009

I’ll be taking a brief hiatus from blogging during the first three weeks of August. Tomorrow we leave for China. You might wonder why we would go to China during the hottest time of the year. For example, our first stop, Xiamen, is expected to hit 95F today, which is fairly typical weather for this time of year (actually its comparable to the unusual weather we’ve been having in Seattle over the last week). The timing of the trip is driven by a boat we’re buying nearing completion in a Xiamen China boat yard: Boat Progress. The goal is to see the boat roughly 90% complete so we can catch any issues early and get them fixed before the boat leaves the yard. And, part of the adventure of building a boat, is to get a chance to visit the yard and see how they are built.

 

We love boating but, having software jobs, we end up working a lot. Consequently, the time we do get off, we spend boating between Olympia, Washington and Alaska. Since we seldom have the time for non-boat related travel, we figured we should take advantage of visiting China and see more than just the boat yard. 

 

After the stop at the boat yard in Xiamen, we’ll visit Hong Kong, Guilin, Yangshou, Chengdu, and do a cruise of the Yangtze River and then travel to Xian followed by Beijing before returning home.  

 

                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Saturday, August 01, 2009 3:26:27 PM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback
Ramblings
 Wednesday, July 29, 2009

Search is a market driven by massive networking effects and economies of scale. The big get better, the big get cheaper, and the big just keep getting bigger. Google has 65% of the Search market and continues to grow. In a deal announced yesterday, Microsoft will supply search to Yahoo and now has a combined share of 28%. For the first time ever, Microsoft has enough market share to justify continuing large investments. And, more importantly, they now have enough market to get good data on usage to tune the ranking engine to drive better quality search. And, although Microsoft and Yahoo! will continue to have separate advertising engines and separate sales forces, they will have more user data available to drive the analytics behind their advertising businesses.  The Search world just got more interesting.

 

The market will continue to unequally reward the big player if nothing else changes. Equal focus of skill and investment will continue to yield unequal results. But, at 28% rather than 8%, its actually possible to gain share and grow even with the negative network effects and economies of scale.  This is good for the Search market, good for the Microsoft Search team, and good for users.

 

NY Times: http://www.nytimes.com/2009/07/30/technology/companies/30soft.html?hpw

WSJ: http://online.wsj.com/article/BT-CO-20090729-709160.html

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Wednesday, July 29, 2009 4:55:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Saturday, July 25, 2009

MapReduce has created some excitement in the relational database community. Dave Dewitt and Michael Stonebraker’s MapReduce: A Major Step Backwards is perhaps the best example.  In that posting they argued that map reduce is a poor structured storage technology, the execution engine doesn’t include many of the advances found in modern, parallel RDBMS execution engines, it’s not novel, and its missing features.

 

In Mapreduce: A Minor Step Forward I argued that MapReduce is an execution model rather than storage engine. It is true that it is typically run over a file system like GFS or HDFS or simple structured storage system like BigTable or Hbase. But, it could be run over a full relational database.

 

Why would we want to run Hadoop over a full relational database?  Hadoop scales: Hadoop has been scaled to 4,000 nodes at Yahoo! Scaling Hadoop to 4000 nodes at Yahoo!.  Scaling a clustered RDBMS too 4k nodes is certainly possible but the high scale single system image cluster I’ve seen was 512 nodes (what was then called DB2 Parallel Edition). Getting to 4k is big.  Hadoop is simple: automatic parallelism has been an industry goal for decades but progress has been limited. There really hasn’t been success in allowing programmers of average skill to write massively parallel programs except for SQL and Hadoop. Programmers of bounded skill can easily write SQL that will be run in parallel over high scale clusters. Hadoop is the only other example I know where this is possible and happening regularily. 

 

Hadoop makes the application of 100s or even 1000s of nodes of commodity computers easy  so why not Hadoop over full RDBMS nodes?  Daniel Abadi and team from Yale and Brown have done exactly that.  In this case, Hadoop over PostgresSQL. From Daniel’s blog:

 

HadoopDB is:

1.       A hybrid of DBMS and MapReduce technologies targeting analytical query workloads

2.       Designed to run on a shared-nothing cluster of commodity machines, or in the cloud

3.       An attempt to fill the gap in the market for a free and open source parallel DBMS

4.       Much more scalable than currently available parallel database systems and DBMS/MapReduce hybrid systems (see longer blog post).

5.       As scalable as Hadoop, while achieving superior performance on structured data analysis workloads

See: http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html for more detail and http://sourceforge.net/projects/hadoopdb/ for source code for HadoopDB.

 

A more detailed paper has been accepted for publication at VLDB: http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf.

 

The development work for HadoopDB was done using AWS Elastic Compute Cluster. Nice work Daniel.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Saturday, July 25, 2009 9:59:47 AM (Pacific Standard Time, UTC-08:00)  #    Comments [5] - Trackback
Services | Software
 Saturday, July 18, 2009

I presented Where does the Power Go in High Scale Data Centers the opening keynote at SIGMETRICS/Performance 2009 last month. The video of the talk was just posted: SIGMETRICS 2009 Keynote.

 

The talk starts after the conference kick-off at 12:20. The video appears to be incompatible with at least some versions of Firefox. I was only able to stay for the morning of the conference but I met lots of interesting people and got to catch up with some old friends. Thanks to Albert Greenberg and John Douceur for inviting me.

 

I also did the keynote talk at this year’s USENIX Technical Conference 2009 in San Diego. Man, I love San Diego and USENIX was, as usual, excellent. I particularly enjoyed discussions with the Research in Motion team from Waterloo and the Netflix folks. Both are running high-quality, super-high growth services with lots of innovation. Thanks to Alec Wolman for inviting me down to this years USENIX conference.

 

                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Saturday, July 18, 2009 6:02:13 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Services
 Saturday, July 11, 2009

I’m a boater and I view reading about boating accidents as important. The best source that I’ve come across is the UKs Marine Accident Investigation Branch (MAIB). I’m an engineer and again, I view it as important to read about engineering failures and disasters. One of the best sources I know of is Peter G. Neumann’s RISKS Digest.

 

There is no question that firsthand experience is a powerful teacher but few of us have time (or enough lives) to make every possible mistake. There are just too many ways to screw-up. Clearly, it’s worth learning from others when trying to make our own systems more safe or more reliable. On that belief I’m a avid reader of service post mortems. I love understanding what went wrong, thinking of those same issues could impact a service in which I’m involved, and what should be done to avoid the class of problems under discussion. Some of what I’ve learned around services over the years is written up in this best practices document: http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf originally published at USENIX LISA.

 

One post mortem I came across recently and enjoyed was: Message from discussion Information Regarding 2 July 2009 outage. I liked it because there was enough detail to educate and it presented many lessons. If you own or operate a service or mission critical application, it’s worth a read.

 

                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Saturday, July 11, 2009 8:15:09 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Friday, July 10, 2009

There have been many reports of the Fisher Plaza data center fire. An early one was the Data Center Knowledge article: Major Outage at Seattle Data Center. Data center fires aren’t as rare as any of us would like but this one is a bit unusual in that fires normally happen in the electrical equipment or switchgear whereas this one appears to have been a bus duct fire. The bus duct fire triggered the sprinkler system. Several sprinkler heads were triggered and considerable water was sprayed making it more difficult to get the facility back online quickly.

 

Several good pictures showing the fire damage were recently published in Tech Flash Photos: Inside the Fisher Fire.

 

                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Friday, July 10, 2009 5:08:58 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Ramblings
 Thursday, July 09, 2009

MIT’s Barbara Liskov was awarded the 2008 Association of Computing Machinery Turing Award.  The Turning award is the highest distinction in computer science and is often referred to as the Nobel price of computing. Past award winners are listed at: http://en.wikipedia.org/wiki/Turing_Award.

The full award citation:

Barbara Liskov has led important developments in computing by creating and implementing programming languages, operating systems, and innovative systems designs that have advanced the state of the art of data abstraction, modularity, fault tolerance, persistence, and distributed computing systems.

The Venus operating system was an early example of principled operating system design. The CLU programming language was one of the earliest and most complete programming languages based on modules formed from abstract data types and incorporating unique intertwining of both early and late binding mechanisms. ARGUS extended many of the CLU ideas to distributed programming, and incorporated the first versions of nested transactions to maintain predictable consistencies. Other advances include solutions elegantly combining theory and pragmatics in the areas of decentralized information flow, replicated storage and caching of persistent objects, and modular upgrading of distributed systems. Her contributions have been incorporated into the practice of programming, thereby influencing many of the most important systems used today: for programming, specification, systems design, and distributed architectures.

From: http://awards.acm.org/citation.cfm?id=1108679&srt=year&year=2008&aw=140&ao=AMTURING

 

The cover article in the July Communications of the ACM was on the award: http://cacm.acm.org/magazines/2009/7/32083-liskovs-creative-joy/fulltext.

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Thursday, July 09, 2009 8:43:43 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Wednesday, July 08, 2009

Our industry has always moved quickly but the internet and high-scale services have substantially quickened the pace. Search is an amazingly powerful productivity tool and available effectively to free to all. The internet makes nearly all information available to anyone who can obtain time on an internet connection. Social networks and interest-area specific discussion groups are bringing together individuals of like interest from all over the globe.  The cost of computing is falling rapidly and new services are released daily.  The startup community has stayed viable through one of the most severe economic downturns since the great depression. Infrastructure as a service offerings allow new businesses to be build with very little seed investment. I’m amazed at the quality of companies I’m seeing that have 100% bootstrapped without VC funding.  Everything is changing.

 

Netbooks have made low end computers close to free and, in fact, some are released on the North American cell phone model where a multi-year service contract subsidies the device. I’ve seen netbooks for free with a three wireless contract. This morning I came across yet more evidence of healthy change: a new client operating system alterative. The Wall Street Journal reports that Google Plans to Launch Operating Systems for PC (http://online.wsj.com/article/SB124702911173210237.html).  Other articles: http://news.google.com/news?q=google+to+launch+operating+system&oe=utf-8&rls=org.mozilla:en-US:official&client=firefox-a&um=1&ie=UTF-8&hl=en&ei=s5hUSsTlO4PUsQPX7dCaDw&sa=X&oi=news_group&ct=title&resnum=1.

 

The new O/S is Linux based and Linux has long been an option on Netbooks. What’s different in this case is a huge commercial interest is behind advancing the O/S and intends to make it a viable platform on more capable client systems rather than just netbooks. These new lightweight, connected products are made viable by the combination of the wide-spread connectivity and the proliferation of very high-quality, high-fuction services. Having a new O/S player in the game will almost certainly increase the rate of improvement.

 

Alternatives continue to emerge, the cost of computing continues to fall, the pace of change continues to quicken, and everyone from individual consumers through the largest enterprises are gaining from the increased pace of innovation. It’s a fun time to participate in this industry.

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Wednesday, July 08, 2009 5:16:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Tuesday, June 30, 2009

Microsoft announced yesterday that it was planning to bring both Chicago and Dublin online next month. Chicago is initially to be a 30MW critical load facility with a plan to build out to a booming 60MW.  2/3 of the facility is a high scale containerized facility. It’s great to see the world’s second modular data center going online (See  http://perspectives.mvdirona.com/2009/04/01/RoughNotesDataCenterEfficiencySummitPosting3.aspx for details on an earlier Google facility).

 

The containers in Chicago will hold 1,800 to 2,500 servers each. Assuming 200W/server, that’s 1/2 MW for each container with 80 containers on the first floor and a 40MW container critical load. The PUE estimate for the containers is 1.22 which is excellent but it’s very common to include all power conversions below 480VAC and all air moving equipment in the container as critical load so these data can end up not mean much. See: http://perspectives.mvdirona.com/2009/06/15/PUEAndTotalPowerUsageEfficiencyTPUE.aspx for more details on why a better definition of what is infrastructure and what is critical load is needed.

 

Back on April 10th, Data Center Knowledge asked Is Microsoft still committed to containers?  It looks like the answer is unequivocally YES!

 

Dublin is a non-containerized facility initially 5MW with plans to grow to 22MW as demand requires it. The facility is heavily dependent on air-side economization which should be particularly effective in Dublin.

 

More from:

·         Microsoft Blog: http://blogs.technet.com/msdatacenters/archive/2009/06/29/microsoft-brings-two-more-mega-data-centers-online-in-july.aspx

·         Data Center Knowledge: http://www.datacenterknowledge.com/archives/2009/06/29/microsoft-to-open-two-massive-data-centers/

·         MJF: http://blogs.zdnet.com/microsoft/?p=3200

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Tuesday, June 30, 2009 5:44:41 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Hardware
 Wednesday, June 24, 2009

I presented the keynote at the International Symposium on Computer Architecture 2009 yesterday.  Kathy Yelick kicked off the conference with the other keynote on Monday: How to Waste a Parallel Computer.

 

Thanks to ISCA Program Chair Luiz Borroso for the invitation and for organizing an amazingly successful conference.  I’m just sorry I had to leave a day early to attend a customer event this morning. My slides: Internet-Scale Service Infrastructure Efficiency.

 

Abstract: High-scale cloud services provide economies of scale of five to ten over small-scale deployments, and are becoming a large part of both enterprise information processing and consumer services. Even very large enterprise IT deployments have quite different cost drivers and optimizations points from internet-scale services. The former are people-dominated from a cost perspective whereas internet-scale service costs are driven by server hardware and infrastructure with people costs fading into the noise at less than 10%.

 

In this talk we inventory where the infrastructure costs are in internet-scale services. We track power distribution from 115KV at the property line through all conversions into the data center tracking the losses to final delivery at semiconductor voltage levels. We track cooling and all the energy conversions from power dissipation through release to the environment outside of the building. Understanding where the costs and inefficiencies lie, we ll look more closely at cooling and overall mechanical system design, server hardware design, and software techniques including graceful degradation mode, power yield management, and resource consumption shaping.


James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Wednesday, June 24, 2009 6:21:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Monday, June 22, 2009

Title: Ten Ways to Waste a Parallel Computer

Speaker: Katherine Yelick

 

An excellent keynote talk at ISCA 2009 in Austin this morning. My rough notes follow:

·         Moore’s law continues

o   Frequency growth replaced by core count growth

·         HPC has been working on this for more than a decade but HPC concerned as well

·         New World Order

o   Performance through parallelism

o   Power is overriding h/w concern

o   Performance is now a software concern

·         What follows are Yellnick’s top 10 ways to waste a parallel computer

·         #1: Build system with insufficient memory bandwidth

o   Multicore puts us on the wrong side of the memory wall

o   Key metrics to look at:

§  Memory size/bandwidth (time to fill memory)

§  Memory size * alg intensity / op-per-sec (time to process memory)

·         #2: Don’t Take Advantage of hardware performance features

o   Showed example of speedup from tuning nearest-neighbor 7 point stencil on a 3D array

o   Huge gains but hard to do by hand.  Need to do it automatically at code gen time.

·         #3: Ignore Little’s Law

o   Required concurrency = bandwidth * latency

o   Observation is that most apps are running WAY less than full memory bandwidth [jrh: this isn’t because these apps aren’t memory bound. They are waiting on memory with small requests. Essentially they are memory request latency bound rather than bandwidth bound. They need larger requests or more outstanding requests]

o   To make effective use of the machine, you need:

§  S/W prefetch

§  Pass memory around caches in some cases

·         #4: Turn functional problems into performance problems

o   Fault resilience introduces inhomogeneity in execution rates

o   Showed a graph that showed ECC recovery rates (very common) but that the recovery times are substantial and the increased latency of correction is substantially slowing the computation. [jrh: more evidence that non-ECC designs such as current Intel Atom are not workable in server applications.  Given ECC correction rates, I’m increasingly becoming convinced that non-ECC client systems don’t make sense.]

·         #5: Over-Synchronize Applications

o   View parallel executions as directed acyclic graphs of the computation

o   Hiding parallelism in a library tends to over serialize (too many barriers)

o   Showed work from Jack Dongarra on PLASMA as an example

·         #6: Over-synchronize Communications

o   Use a programming model in which you can’t utilize b/w or “low” latency

o   As an example, compared GASNet and MPI with GASNet delivering far higher bandwidth

·         #7: Run Bad Algorithms

o   Algorithmic gains have far outstripped Moore’s law over the last decade

o   Examples: 1) adaptive meshes rather than uniform, 2) sparse matrices rather than dense, and 3) reformulation of problem back to basics.

·         #8: Don’t rethink your algorithms

o   Showed examples of sparse iterative methods and optimizations possible

·         #9: Choose “hard” applications

o   Examples of such systems

§  Elliptic: stead state, global space dependence

§  Hyperbolic: time dependent, local space dependence

§  Parabolic: time dependent, global space dependence

o   There is often no choice – we can’t just ignore hard problems

·         #10: Use heavy-weight cores optimized for serial performance

o   Used Power5 as an example of a poor design by this measure and show a stack of “better” performance/power

§  Power5:

·         389 mm^2

·         120W @ 1900 MHz

§  Intel Core2 sc

·         130 mm^2

·         15W @ 1000 MHz

§  PowerPC450 (BlueGene/P)

·         8mm^2

·         3W @ 850

§  Tensilica (cell phone processor)

·         0.8mm^2

·         0.09W @ 650W

o   [jrh: This last point is not nearly well enough understood. Far too many systems are purchased on performance when they should be purchased on work done per $ and work done per joule.]

·         Note: Large scale machines have 1 unrecoverable memory error (UME) per day [jrh: again more evidence that no-ECC server designs such as current Intel Atom boards simply won’t be acceptable in server applications, nor embedded, and with memory sizes growing evidence continues to mount that we need to move to ECC on client machines as well]

·         HPC community shows that parallelism is key but serial performance can’t be ignored.

·         Each factor of 10 increase in performance, tends to require algorithmic rethinks

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Monday, June 22, 2009 7:04:50 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<October 2009>
SunMonTueWedThuFriSat
27282930123
45678910
11121314151617
18192021222324
25262728293031
1234567

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton