Saturday, November 20, 2010

I’m interested in low-cost, low-power servers and have been watching the emerging market for these systems since 2008 when I wrote CEMS: Low-Cost, Low-Power Servers for Internet Scale Services (paper, talk).  ZT Systems just announced the R1081e, a new ARM-based server with the following specs:

·         STMicroelectronics SPEAr 1310 with dual ARM® Cortex™-A9 cores

·         1 GB of 1333MHz DDR3 ECC memory embedded

·         1 GB of NAND Flash

·         Ethernet connectivity

·         USB

·         SATA 3.0

 It’s a shared infrastructure design where each 1RU module has 8 of the above servers. Each module includes:

·         8 “System on Modules“ (SOMs)

·         ZT-designed backplane for power and connectivity

·         One 80GB SSD per SOM

·         IPMI system management

·         Two Reatek 4+1 1Gb Ethernet switches on board with external uplinks

·         Standard 1U rack mount form factor

·         Ubuntu Server OS

·         250W 80+ Bronze Power Supply

 

Each module is under 80W so a rack with 40 compute modules would only draw 3.2kw for 320 low-power servers for a total of 740 cores/rack. Weaknesses of this approach are: only 2-cores per server, only 1GB/core, and the cores appear to be only 600 Mhz (http://www.32bitmicro.com/manufacturers/65-st/552-spear-1310-dual-cortex-a9-for-embedded-applications-). Four core ARM parts and larger physical memory support are both under development.

 

Competitors include SeaMicro with an Atom based design (SeaMicro Releases Innovative Intel Atom Server

)  and the recently renamed Calxeda (previously Smooth-Stone) has an ARM-based product under development.

 

Other notes on low-cost, low-powered servers:

·         SeaMicro Releases Innovative Intel Atom Server

·         When Very Low-Power, Low-Cost Servers don’t make snese

·         Very Low-Power Server Progress

·         The Case for Low-Cost, Low-Power Servers

·         2010 the Year of the Microslice Servers

·         Linux/Apache on ARM processors

·         ARM Cortex-A9 SMP Design Announced

  

From Datacenter Knowledge: New ARM-Based Server from ZT systems

 

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Saturday, November 20, 2010 8:06:34 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Hardware
 Wednesday, November 17, 2010

Earlier this week Clay Magouyrk sent me a pointer to some very interesting work: A Couple More Nails in the Coffin of the Private Compute Cluster: Benchmarks for the Brand New Cluster GPU Instance on Amazon EC2.

 

This detailed article has detailed benchmark results from runs on the new Cluster GPU Instance type and leads in with:

During the past few years it has been no secret that EC2 has been best cloud provider for massive scale, but loosely connected scientific computing environments. Thankfully, many workflows we have encountered have performed well within the EC2 boundaries. Specifically, those that take advantage of pleasantly parallel, high-throughput computing workflows. Still, the AWS approach to virtualization and available hardware has made it difficult to run workloads which required high bandwidth or low latency communication within a collection of distinct worker nodes. Many of the AWS machines used CPU technology that, while respectable, was not up to par with the current generation of chip architectures. The result? Certain use cases simply were not a good fit for EC2 and were easily beaten by in-house clusters in benchmarking that we conducted within the course of our research. All of that changed when Amazon released their Cluster Compute offering.

 

The author goes on to run the Saleable Heterogeneous cOmputing BenChmarking Suite and compare EC2 with Native performance and conclude:

 

With this new AWS offering, the line between internal hardware and virtualized, cloud-based hardware for high performance computing using GPUs has indeed been blurred.

 

Finally a run with a Cycle Computing customer workload:

Based on the positive results of our SHOC benchmarking, we approached a Fortune 500 Life Science and a Finance/Insurance clients who develop and use their own GPU-accelerated software, to run their applications on the GPU-enabled Cluster Compute nodes. For both applications, the applications perform a large number of Monte Carlo simulations for given set of initial data, all pleasantly parallel. The results, similar to the SHOC result, were that the EC2 GPU-enabled Cluster Compute nodes performed as well as, or better than, the in-house hardware maintained by our clients.

 

Even if you only have a second, give the results a scan: http://blog.cyclecomputing.com/2010/11/a-couple-more-nails-in-the-coffin-of-the-private-compute-cluster-gpu-on-cloud.html.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Wednesday, November 17, 2010 6:15:32 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Monday, November 15, 2010

A year and half ago, I did a blog post titled heterogeneous computing using GPGPUs and FPGA. In that note I defined heterogeneous processing as the application of processors with different instruction set architectures (ISA) under direct application programmer control and pointed out that this really isn’t all that new a concept. We have had multiple ISAs in systems for years. IBM mainframes had I/O processors (Channel I/O Processors) with a different ISA than the general CPUs , many client systems have dedicated graphics coprocessors, and floating point units used to be independent from the CPU instruction set before that functionality was pulled up onto the chip. The concept isn’t new. 

 

What is fairly new is 1) the practicality of implementing high-use software kernels in hardware and 2) the availability of commodity priced parts capable of vector processing. Looking first at moving custom software kernels into hardware, Field Programmable Gate Arrays (FPGA) are now targeted by some specialized high level programming languages. You can now write in a subset of C++ and directly implement commonly used software kernels in hardware. This is still the very early days of this technology but some financial institutions have been experimenting with moving computationally expensive financial calculations into FPGAs to save power and cost. See Platform-Based Electronic System-Level (ESL) Synthesis for more on this.

 

The second major category of heterogeneous computing is much further evolved and is now beginning to hit the mainstream. Graphics Processing Units (GPUs) essentially are vector processing units in disguise.  They originally evolved as graphics accelerators but it turns out a general purpose graphics processor can form the basis of an incredible Single Instruction Multiple Data (SIMD) processing engine. Commodity GPUs have most of the capabilities of the vector units in early supercomputers. What’s missing is they have been somewhat difficult to program in that the pace of innovation is high and each model of GPU have differences in architecture and programming models. It’s almost impossible to write code that will directly run over a wide variety of different devices. And the large memories in these graphics processors typically are not ECC protected. An occasional pixel or two wrong doesn’t really matter in graphical output but you really do need ECC memory for server side computing.

 

Essentially we have commodity vector processing units that are hard to program and lack ECC. What to do? Add ECC memory and a abstraction  layer that hides many of the device-to-device differences. With those two changes, we have amazingly powerful vector units available at commodity prices.  One abstraction layer that is getting fairly broad pickup is Compute Unified Device Architecture or CUDA developed by NVIDIA. There are now CUDA runtime support libraries for many programming languages including C, FORTRAN, Python, Java, Ruby, and Perl.

 

Current generation GPUS are amazingly capable devices. I’ve covered the speeds and feeds of a couple in past postings: ATI RV770 and the NVIDIA GT200.

 

Bringing it all together, we have commodity vector units with ECC and an abstraction layer that makes it easier to program them and allows programs to run unchanged as devices are upgraded. Using GPUs to host general compute kernels is generically referred to as General Purpose Computation on Graphics Processing Units.  So what is missing at this point?  The pay-as-you go economics of cloud computing.

 

You may recall I was excited last July when Amazon Web Services announced the Cluster Compute Instance: High Performance Computing Hits the Cloud. The EC2 Cluster Compute Instance is capable but lacks a GPU:

·         23GB memory with ECC

·         64GB/s main memory bandwidth

·         2 x Intel Xeon X5570 (quad-core Nehalem)

·         2 x 845GB HDDs

·         10Gbps Ethernet Network Interface

 

What Amazon Web Services just announced is a new instance type with the same core instance specifications as the cluster compute instance above, the same high-performance network, but with the addition of two NVIDIA Tesla M2050 GPUs in each server. See supercomputing at 1/10th the Cost. Each of these GPGPUs is capable of over 512 GFLOPs and so, with two of these units per server, there is a booming teraFLOP per node.

 

Each server in the cluster is equipped with a 10Gbps network interface card connected to a constant bisection bandwidth networking fabric. Any node can communicate with any other node at full line rate. It’s a thing of beauty and a forerunner of what every network should look like.

 

There are two full GPUs in each Cluster GPU instance each of which dissipates 225W TDP. This felt high to me when I first saw it but, looking at work done per watt, it’s actually incredibly good for workloads amenable to vector processing.  The key to the power efficiency is the performance. At over 10x the performance of a quad core x86, the package is both power efficient and cost efficient.

 

The new cg1.4xlarge EC2 instance type:

·         2 x  NVIDIA Tesla M2050 GPUs

·         22GB memory with ECC

·         2 x Intel Xeon X5570 (quad-core Nehalem)

·         2 x 845GB HDDs

·         10Gbps Ethernet Network Interface

 

With this most recent announcement, AWS now has dual quad core servers each with dual GPGPUs connected by a 10Gbps full-bisection bandwidth network for $2.10 per node hour. That’s $2.10 per teraFLOP. Wow.

·         The Amazon Cluster GPU Instance type announcement: Announcing Cluster GPU Instances for EC2

·         More information on the EC2 Cluster GPU instances: http://aws.amazon.com/ec2/hpc-applications/

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

 

Monday, November 15, 2010 4:57:46 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Sunday, November 14, 2010

The Top 500 Super Computer Sites list was just updated and AWS Compute Cluster is now officially in the top ½ of the list. That means when you line up all the fastest, multi-million dollar, government lab sponsored super computers from #1 through to #500, the AWS Compute Cluster Instance is at  #231. 

 

231

Amazon Web Services
United States

Amazon EC2 Cluster Compute Instances - Amazon EC2 Cluster, Xeon X5570 2.95 Ghz, 10G Ethernet / 2010
Self-made

7040

41.82

82.51

One of the fastest supercomputers in the world for $1.60/node hour. Cloud computing changes the economics in a pretty fundamental way.

More details at: High Performance Computing Hits the Cloud.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

Sunday, November 14, 2010 11:21:19 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Friday, November 12, 2010

Kushagra Vaid presented at Datacenter Power Efficiency at Hotpower ’10. The URL to the slides is below and my notes follow. Interesting across the board but most notable for all the detail on the four major Microsoft datacenters:

 

·         Services speeds and feeds:

o   Windows LiveID: More than 1B authentications/day

o   Exchange Hosted Services: 2 to 4B email messages per day

o   MSN: 550M unique visitors monthly

o   Hotmail: 400M active users

o   Messenger: 320M active users

o   Bing: >2B queries per month

·         Microsoft Datacenters: 141MW total, 2.25M sq ft

o   Quincy:

§  27MW

§  500k sq ft total

§  54W/sq ft

o   Chicago:

§  60MW

§  707k sq ft total

§  85W/sq ft

§  Over $500M invested

§  $8.30/W (assuming $500M cost)

§  3,400 tons of steel

§  190 miles of conduit

§  2,400 tons of copper

§  26,000 yards of concrete

§  7.5 miles of chilled water piping

§  1.5M man hours of labor

§  Two floors of IT equipment:

·         Lower floor: medium reliability container bay (standard IOS containers supplied by Dell)

·         Upper floor: high reliability traditional colo facility

§  Note the picture of the upper floor shows colo cages suggesting that Microsoft may not be using this facility for their internal workloads.

§  PUE: 1.2 to 1.5

§  3,000 construction related jobs

o   Dublin:

§  27MW

§  570k sq ft total

§  47W/sq ft

o   San Antonio:

§  27MW

§  477k sq ft total

§  57W/sq ft

·         The power densities range between 45 and 85W/sq ft range which is incredibly low for relatively new builds. I would have expected something in the 200W/sq ft to 225W/sq ft range and perhaps higher. I suspect the floor space numbers are gross floor space numbers including mechanical, electrical, and office space rather than raised floor.

·         Kushagra reports typical build costs range from $10 to $15/W

o   based upon the Chicago data point,  the Microsoft build costs are around $8/sq ft. Perhaps a bit more in the other facilities in that Chicago isn’t fully redundant on the lower floor whereas the generator count at the other facilities suggest they are.

·         Gen 4 design

o   Modular design but the use of ISO standard containers has been discontinued. However, they modules are prefabricated and delivered by flatbed truck

o   No mechanical cooling

§  Very low water consumption

o   30% to 50% lower costs

o   PUE of 1.05 to 1.1

·         Analysis of AC vs DC power distribution concluding that DC more efficient at low loads and AC at high loads

o   Over all the best designs are within 1 to 2% of each other

·         Recommends higher datacenter temperatures

·         Key observation:

o   The best price/performance and power/performance is often found in lower-power processors

·         Kushagra found substantial power savings using C6 and showed server idle power can be dropped to 31% of full load power

o   Good servers typically run in the 45 to 50% range and poor designs can be as bad as 80%

 

The slides: http://mvdirona.com/jrh/TalksAndPapers/KushagraVaidMicrosoftDataCenters.pdf

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Friday, November 12, 2010 12:02:55 PM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Services
 Sunday, October 31, 2010

I did a talk earlier this week on the sea change currently taking place in datacenter networks. In Datacenter Networks are in my Way I start with an overview of where the costs are in a high scale datacenter. With that backdrop, we note that networks are fairly low power consumers relative to the total facility consumption and not even close to the dominant cost. Are they actually a problem? The rest of the talk is arguing networks are actually a huge problem across the board including cost and power. Overall, networking gear lags behind the rest of the high-scale infrastructure world, block many key innovations, and actually are both cost and power problems when we look deeper.

 

The overall talk agenda:

·         Datacenter Economics

·         Is Net Gear Really the Problem?

·         Workload Placement Restrictions

·         Hierarchical & Over-Subscribed

·         Net Gear: SUV of the Data Center

·         Mainframe Business Model

·         Manually Configured & Fragile at Scale

·         New Architecture for Networking

 

In a classic network design, there is more bandwidth within a rack and more within an aggregation router than across the core. This is because the network is over-subscribed. Consequently, instances of a workload often needs to be placed topologically near to other instances of the workload, near storage, near app tiers, or on the same subnet. All these placement restrictions make the already over-constrained workload placement problem even more difficult. The result is either the constraints are not met which yields poor workload performance or the constraints are met but overall server utilization is lower due to accepting these constraints. What we want is all points in the datacenter equidistant and no constraints on workload placement.

 

Continuing on the over-subscription problem mentioned above, data intensive workloads like MapReduce and high performance computing workloads run poorly on oversubscribed networks.  Its not at all uncommon for a MapReduce workload to transport the entire data set at least once over the network during job run. The cost of providing a flat, all-points-equidistant network are so high, that most just accept the constraint and other run MapReduce poorly or only run them in narrow parts of the network (accepting workload placement constraints).

 

Net gear doesn’t consume much power relative to total datacenter power consumption – other gear in the data center are, in aggregate much worse. However, network equipment power is absolutely massive today and it is trending up fast. A fully configured Cisco Nexus 7000 requires 8 circuits of 5kw each. Admittedly some of that power is for redundancy but how can 120 ports possibly require as much power provisioned as 4 average sized full racks of servers? Net gear is the SUV of the datacenter.

 

The network equipment business model is broken. We love the server business model where we have competition at the CPU level, more competition at the server level, and an open source solution for control software.  In the networking world, it’s a vertically integrated stack and this slows innovation and artificially holds margins high. It’s a mainframe business model.

New solutions are now possible with competing merchant silicon from Broadcom, Marvell, and Fulcrum and competing switch designs built on all three. We don’t yet have the open source software stack but there are some interesting possibilities on the near term horizon with OpenFlow being perhaps the most interesting enabler. More on the business model and why I’m interested in OpenFlow at: Networking: The Last Bastion of the Mainframe Computing.

 

Talk slides: http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_POA20101026_External.pdf

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, October 31, 2010 11:30:28 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Hardware | Services
 Thursday, October 21, 2010

What happens when you really, really, focus on efficient infrastructure and driving down costs while delivering a highly available, high performance service? Well, it works.  Costs really do fall and the savings can be passed on to customers.

 

AWS prices have been falling for years but this is different.  Its now possible to offer an small EC2 instance for free. You can now have 750 hours of EC2 usage per month without charge.

 

The details:

 

·         750 hours of Amazon EC2 Linux Micro Instance usage (613 MB of memory and 32-bit and 64-bit platform support) – enough hours to run continuously each month*

·         750 hours of an Elastic Load Balancer plus 15 GB data processing*

·         10 GB of Amazon Elastic Block Storage, plus 1 million I/Os, 1 GB of snapshot storage, 10,000 snapshot Get Requests and 1,000 snapshot Put Requests*

·         5 GB of Amazon S3 storage, 20,000 Get Requests, and 2,000 Put Requests*

·         30 GB per of internet data transfer (15 GB of data transfer “in” and 15 GB of data transfer “out” across all services except Amazon CloudFront)*

·         25 Amazon SimpleDB Machine Hours and 1 GB of Storage**

·         100,000 Requests of Amazon Simple Queue Service**

·         100,000 Requests, 100,000 HTTP notifications and 1,000 email notifications for Amazon Simple Notification Service**

 

More information: http://aws.amazon.com/free/

 

                                      --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

 

Thursday, October 21, 2010 4:35:44 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Services
 Tuesday, October 19, 2010

Long time Amazon Web Services Alum Jeff Barr has written a book on AWS. Jeff’s been with AWS since the very early days and he knows the services well. The new book Host Your Web Site in the Cloud: Amazon Web Services Made Easy, covers each of the major AWS services, how to write code against them, with code examples in PHP. It covers S3, EC2, SQS, EC2 Monitoring, Auto Scaling, Elastic Load Balancing, and SimpleDB. The table of contents:

 

 Recommended if you are interested in Cloud Computing and AWS: http://www.amazon.com/Host-Your-Web-Site-Cloud/dp/0980576830.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, October 19, 2010 7:21:29 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Sunday, October 10, 2010

This morning I came across an article written by  Sid Anand, an architect at Netflix that is super interesting. I liked it for two reasons: 1) it talks about the move of substantial portions of a high-scale web site to the cloud, some of how it was done, and why it was done, and 2) its gives best practices on AWS SimpleDB usage.

 

I love articles about how high scale systems work. Some past postings:

 

The article starts off by explaining why Netflix decided to move their infrastructure to the cloud:

 

Circa late 2008, Netflix had a single data center. This single data center raised a few concerns. As a single-point-of-failure, it represented a liability – data center outages meant interruptions to service and negative customer impact. Additionally, with growth in both streaming adoption and subscription levels, Netflix would soon outgrow this data center – we foresaw an imminent need for more power, better cooling, more space, and more hardware.

 

Our option was to build more data centers. Aside from high upfront costs, this endeavor would likely tie up key engineering resources in data center scale out activities, making them unavailable for new product initiatives.  Additionally, we recognized the management of multiple data centers to be a complex task. Building out and managing multiple data centers seemed a risky distraction.

 

Rather than embarking on this path, we chose a more radical one. We decided to leverage one of the leading IAAS offerings at the time, Amazon Web Services. With multiple, large data centers already in operation and multiple levels of redundancy in various web services (e.g. S3 and SimpleDB), AWS promised better availability and scalability in a relatively short amount of time.

 

By migrating various network and back-end operations to a 3rd party cloud provider, Netflix chose to focus on its core competency: to deliver movies and TV shows.

 

I’ve read considerable speculation over the years on the difficulty of moving to cloud services. Some I agree with – these migrations do take engineering investment  – while other reports seem to less well thought through focusing mostly on repeating concerns speculated upon by others. Often, the information content is light.

 

I know the move to the cloud can be done and is done frequently because, where I work, I’m lucky enough to see it happen every day. But the Netflix example is particularly interesting in that 1) Netflix is a fairly big enterprise with a market capitalization of $7.83B – moving this infrastructure is substantial and represents considerable complexity. It is a great example of what can be done; 2) Netflix is profitable and has no economic need to make the change – they made the decision to avoid distraction and stay focused on the innovation that made the company as successful as it is; and 3) they are willing to contribute their experiences back to the community. Thanks to Sid and Netflix for the later.

 

For more detail, check out the more detailed document that Sid Anand has posted: Netflix’s Transition to High-Availability Storage Systems.

 

For those SimpleDB readers, here’s a set of SimpleDB best practices from Sid’s write-up:

 

·         Partial or no SQL support. Loosely-speaking, SimpleDB supports a subset of SQL

o   Do GROUP BY and JOIN operations in the application layer

o   One way to avoid the need for JOINs is to denormalize multiple Oracle tables into a single logical SimpleDB domain.

·         No relations between domains

o   Compose relations in the application layer

·         No transactions

o   Use SimpleDB’s Optimistic Concurrency Control API: ConditionalPut and ConditionalDelete

·         No Triggers

o   Do without

·         No PL/SQL

o   Do without

·         No schema - This is non-obvious. A query for a misspelled attribute name will not fail with an error

o   Implement a schema validator in a common data access layer

·         No sequences

o   Sequences are often used as primary keys

§  In this case, use a naturally occurring unique key. For example, in a Customer Contacts domain, use the customer’s mobile phone number as the item key

§  If no naturally occurring unique key exists, use a UUID

o   Sequences are also often used for ordering

§  Use a distributed sequence generator

·         No clock operations

o   Do without

·         No constraints. Specifically, no uniqueness, no foreign key, no referential & no integrity constraints

o   Application can check constraints on read and repair data as a side effect. This is known as read-repair. Read repair should use the CondtionalPut or ConditionalDelete API to make atomic changes

 

If you are interested high sale web sites or cloud computing in general, this one is worth a read: Netflix’s Transition to High-Availability Storage Systems.

 

                                                --jrh

Update:  Netflix architects Adrian Cockcroft and Sid Anand are both presenting at QconSF between November 1st and 5th in San Francisco: http://qconsf.com/sf2010.

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, October 10, 2010 11:53:44 AM (Pacific Standard Time, UTC-08:00)  #    Comments [13] - Trackback
Services
 Saturday, October 09, 2010

Hosting multiple MySQL engines with MySQL Replication between them is a common design pattern for scaling read-heavy MySQL workloads. As with all scaling techniques, there are workloads for which it works very well but there are also potential issues that need to be understood. In this case, all write traffic is directed to the primary server and, consequently is not scaled which is why this technique works best for workloads heavily skewed towards reads. But, for those fairly common read heavy workloads, the techniques works very well and allows scaling the read workload across over a fleet of MySQL instances.  Of course, as with any asynchronous replication scheme, the read replicas are not transactionally updated. So any application running on MySQL read replica’s must be tolerant of eventually consistent updates.

 

Load balancing high read traffic over multiple MySQL instances works very well but this is only one of the possible tools used to scale this type of workload. Another very common technique is to put a scalable caching layer in front of the relational database fleet. By far the most common caching layer used by high-scale services is Memcached.

 

Another database scaling technique is to simply not use a relational database. For workloads that don’t need schema enforcement and complex query, NoSQL databases offer both a cheaper and a simpler approach to hosting the workload. SimpleDB is the AWS hosted NoSQL database with Netflix being one of the best known users (slide deck from Netflix’s Adrian Cockcroft: http://www.slideshare.net/adrianco/netflix-oncloudteaser). Cassandra is another common RDBMS alternative in heavy use by many high-scale sites including Facebook where it was originally conceived. Cassandra is also frequently run on AWS with the Cassandra Wiki offering scripts to make it easy install and configure on Amazon EC2.

 

For those workloads where a relational database is the chosen solution, MySQL read replication is a good technique to have in your scaling tool kit. Last week Amazon announced read replica support for the AWS Relational Database Service. The press release is at: Announcing Read Replicas, Lower High Memory DB Instance Price for Amazon AWS.

 

You can now create one or more replicas of a given “source” DB Instance and serve incoming read traffic from multiple copies of your data. This new database deployment option enables you to elastically scale out beyond the capacity constraints of a single DB Instance for read-heavy database workloads. You can use Read Replicas in conjunction with Multi-AZ replication for scalable, reliable, and highly available production database deployments.

 

If you are running MySQL and wish you had someone else to manage it for you, check out Amazon RDS. The combination of read replicas to scale read workloads and Multi-AZ support for multi-data center, high availability make it a pretty interesting way to run MySQL.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Saturday, October 09, 2010 10:05:02 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Services
 Saturday, September 18, 2010

A couple of years ago, I did a detailed look at where the costs are in a modern , high-scale data center. The primary motivation behind bringing all the costs together was to understand where the problems are and find those easiest to address. Predictably, when I first brought these numbers together, a few data points just leapt off the page: 1) at scale, servers dominate overall costs, and 2) mechanical system cost and power consumption seems unreasonably high. Both of these areas have proven to be important technology areas to focus upon and there has been considerable industry-wide innovation particularly in cooling efficiency over the last couple of years.

 

I posted the original model at the Cost of Power in Large-Scale Data Centers. One of the reasons I posted it was to debunk the often repeated phrase “power is the dominate cost in a large-scale data center”. Servers dominate with mechanical systems and power distribution  close behind. It turns out that power is incredibly important but it’s not the utility kWh charge that makes power important. It’s the cost of the power distribution equipment required to consume power and the cost of the mechanical systems that take the heat away once the power is consumed. I referred to this as fully burdened power. 

 

Measured this way, power is the second most important cost. Power efficiency is highly leveraged when looking at overall data center costs, it plays an important role in environmental stewardship, and it is one of the areas where substantial gains continue to look quite attainable. As a consequence, this is where I spend a considerable amount of my time – perhaps the majority – but we have to remember that servers still dominate the overall capital cost.

 

This last point is a frequent source of confusion.  When server and other IT equipment capital costs are directly compared with data center capital costs, the data center portion actually is larger. I’ve frequently heard “how can the facility cost more than the servers in the facility – it just doesn’t make sense.”  I don’t know whether or not it makes sense but it actually is not true at this point. I could imagine the infrastructure costs one day eclipsing those of servers as server costs continue to decrease but we’re not there yet. The key point to keep in mind is the amortization periods are completely different. Data center amortization periods run from 10 to 15 years while server amortizations are typically in the three year range. Servers are purchased 3 to 5 times during the life of a datacenter so, when amortized properly, they continue to dominate the cost equation.

 

In the model below, I normalize all costs to a monthly bill by taking consumable like power and billing them monthly by consumption and taking capital expenses like servers, networking or datacenter infrastructure, and amortizing over their useful lifetime using a 5% cost of money and, again, billing monthly. This approach allows us to compare non-comparable costs such as data center infrastructure with servers and networking gear each with different lifetimes. The model includes all costs “below the operating system” but doesn’t include software licensing costs mostly because open source is dominant in high scale centers and partly because licensing costs very can vary so widely. Administrative costs are not included for the same reason. At scale, hardware administration, security, and other infrastructure-related people costs disappear into the single digits with the very best services down in the 3% range. Because administrative costs vary so greatly, I don’t include them here. On projects with which I’ve been involved, they are insignificantly small so don’t influence my thinking much. I’ve attached the spreadsheet in source form below so you can add in factors such as these if they are more relevant in your environment.

 

Late last year I updated the model for two reasons: 1) there has been considerable infrastructure innovation over the last couple of years and costs have changed dramatically during that period and 2) because of the importance of networking gear to the cost model, I factor out networking from overall IT costs. We now have IT costs with servers and storage modeled separately from networking. This helps us understand the impact of networking on overall capital cost and on IT power.

 

When I redo these data, I keep the facility server count in the 45,000 to 50,000 server range. This makes it an reasonable scale facility –big enough to enjoy the benefits of scale but nowhere close to the biggest data centers. Two years ago, 50,000 servers required a 15MW facility (25MW total load). Today, due to increased infrastructure efficiency and reduced individual server power draw, we can support 46k servers in an 8MW facility (12MW total load). The current rate of innovation in our industry is substantially higher than it has been any time in the past with much of this innovation driven by mega service operators.

 

Keep in mind, I’m only modeling those techniques well understood and reasonably broadly accepted as good quality data center design practices. Most of the big operators will be operating at efficiency levels far beyond those used here. For example, in this model we’re using a Power Usage Efficiency (PUE) of 1.45 but Google, for example, reports PUE across the fleet of under 1.2: Data Center Efficiency Measurements. Again, the spread sheet source is attached below so feel free to change to the PUE used by the model as appropriate.

 

These are the assumptions used by this year’s model:

 

Using these assumptions we get the following cost structure:

 

 

 

For those of you interested in playing with different assumptions, the spreadsheet source is here: http://mvdirona.com/jrh/TalksAndPapers/PerspectivesDataCenterCostAndPower.xls.

 

If you choose to use this spreadsheet directly or the data above, please reference the source and include the URL to this pointing.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Saturday, September 18, 2010 12:56:19 PM (Pacific Standard Time, UTC-08:00)  #    Comments [16] - Trackback
Services
 Tuesday, September 14, 2010

For those of you writing about your work on high scale cloud computing (and for those interested in a great excuse to visit Anchorage Alaska), consider submitting a paper to the Workshop on Data Intensive Cloud Computing in the Clouds (DataCloud 2011). The call for papers is below.

 

                                                                --jrh

 

-------------------------------------------------------------------------------------------

                                           *** Call for Papers ***
      WORKSHOP ON DATA INTENSIVE COMPUTING IN THE CLOUDS (DATACLOUD 2011)
                  In conjunction with IPDPS 2011, May 16, Anchorage, Alaska
                                
http://www.cct.lsu.edu/~kosar/datacloud2011
-------------------------------------------------------------------------------------------

The First International Workshop on Data Intensive Computing in the Clouds (DataCloud2011) will be held in conjunction with the 25th IEEE International Parallel and Distributed Computing Symposium (IPDPS 2011), in Anchorage, Alaska.

Applications and experiments in all areas of science are becoming increasingly complex and more demanding in terms of their computational and data requirements. Some applications generate data volumes reaching hundreds of terabytes and even petabytes. As scientific applications become more data intensive, the management of data resources and dataflow between the storage and compute resources is becoming the main bottleneck. Analyzing, visualizing, and disseminating these large data sets has become a major challenge and data intensive computing is now considered as the “fourth paradigm” in scientific discovery after theoretical, experimental, and computational science.

DataCloud2011 will provide the scientific community a dedicated forum for discussing new research, development, and deployment efforts in running data-intensive computing workloads on Cloud Computing infrastructures. The DataCloud2011 workshop will focus on the use of cloud-based technologies to meet the new data intensive scientific challenges that are not well served by the current supercomputers, grids or compute-intensive clouds. We believe the workshop will be an excellent place to help the community define the current state, determine future goals, and present architectures and services for future clouds supporting data intensive computing.

Topics of interest include, but are not limited to:

- Data-intensive cloud computing applications, characteristics, challenges
- Case studies of data intensive computing in the clouds
- Performance evaluation of data clouds, data grids, and data centers
- Energy-efficient data cloud design and management
- Data placement, scheduling, and interoperability in the clouds
- Accountability, QoS, and SLAs
- Data privacy and protection in a public cloud environment
- Distributed file systems for clouds
- Data streaming and parallelization
- New programming models for data-intensive cloud computing
- Scalability issues in clouds
- Social computing and massively social gaming
- 3D Internet and implications
- Future research challenges in data-intensive cloud computing

IMPORTANT DATES:
Abstract submission: December 1, 2010
Paper submission: December 8, 2010
Acceptance notification: January 7, 2011
Final papers due: February 1, 2011

PAPER SUBMISSION:
DataCloud2011 invites authors to submit original and unpublished technical papers. All submissions will be peer-reviewed and judged on correctness, originality, technical strength, significance, quality of presentation, and relevance to the workshop topics of interest. Submitted papers may not have appeared in or be under consideration for another workshop, conference or a journal, nor may they be under review or submitted to another forum during the DataCloud2011 review process. Submitted papers may not exceed 10 single-spaced double-column pages using 10-point size font on 8.5x11 inch pages (IEEE conference style), including figures, tables, and references. DataCloud2011 also requires submission of a one-age (~250 words) abstract one week before the paper submission deadline.

WORKSHOP and PROGRAM CHAIRS:
Tevfik Kosar, Louisiana State University
Ioan Raicu, Illinois Institute of Technology

STEERING COMMITTEE:
Ian Foster, Univ of Chicago & Argonne National Lab
Geoffrey Fox, Indiana University
James Hamilton, Amazon Web Services
Manish Parashar, Rutgers University & NSF
Dan Reed, Microsoft Research
Rich Wolski, University of California, Santa Barbara
Liang-Jie Zhang, IBM Research

PROGRAM COMMITTEE:
David Abramson, Monash University, Australia
Roger Barga, Microsoft Research
John Bent, Los Alamos National Laboratory
Umit Catalyurek, Ohio State University
Abhishek Chandra, University of Minnesota
Rong N. Chang, IBM Research
Alok Choudhary, Northwestern University
Brian Cooper, Google
Ewa Deelman, University of Southern California
Murat Demirbas, University at Buffalo
Adriana Iamnitchi, University of South Florida
Maria Indrawan, Monash University, Australia
Alexandru Iosup, Delft University of Technology, Netherlands
Peter Kacsuk, Hungarian Academy of Sciences, Hungary
Dan Katz, University of Chicago
Steven Ko, University at Buffalo
Gregor von Laszewski, Rochester Institute of Technology
Erwin Laure, CERN, Switzerland
Ignacio Llorente, Universidad Complutense de Madrid, Spain
Reagan Moore, University of North Carolina
Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory
Ian Taylor, Cardiff University, UK
Douglas Thain, University of Notre Dame
Bernard Traversat, Oracle
Yong Zhao, Univ of Electronic Science & Tech of China

 

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, September 14, 2010 12:45:05 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings

I’m dragging myself off the floor as I write this having watched this short video: MongoDB is Web Scale.

 

It won’t improve your datacenter PUE, your servers won’t get more efficient, and it won’t help you scale your databases but, still, you just have to check out that video.

 

Thanks to Andrew Certain of Amazon for sending it my way.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, September 14, 2010 5:55:41 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Ramblings
 Thursday, September 09, 2010

You can now run an EC2 instance 24x7 for under $15/month with the announcement last night of the Micro instance type. The Micro includes 613 MB of RAM and can burst for short periods up to 2 EC2 Compute Units(One EC2 Compute Unit equals 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor). They are available in all EC2 supported regions and in 32 or 64 bit flavors.

 

The design point for Micro instances is to offer a small amount of consistent, always available CPU resources while supporting short bursts above this level. The instance type is well suited for lower throughput applications and web sites that consume significant compute cycles but only periodically.

 

Micro instances are available for Windows or Linux with Linux at $0.02/hour and windows at $0.03/hour. You can quickly start a Micro instance using the EC2 management console: EC2 Console.

 

The announcement: Announcing Micro Instances for Amazon EC2

More information: Amazon Elastic Compute Cloud (Amazon EC2)

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

Thursday, September 09, 2010 4:53:17 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Services
 Thursday, August 05, 2010

I’m taking some time off and probably won’t blog again until the first week of September. Jennifer and I are taking the boat north to Alaska. Most summers we spend a bit of time between the northern tip of Vancouver island and the Alaska border. This year is a little different for 2 reasons. First, we’re heading further north than in the past and will spend some time in Glacier Bay National Park & Preserve. The second thing that makes this trip a bit different is, weather permitting, we’ll be making the nearly thousand mile one way trip as an off shore crossing.  It’ll take roughly 5 days to cover the distance running 24x7 off the cost of British Columbia.

 

You might ask why we would want to make the trip running 24x7 off shore when the shoreline of BC is one of the most beautiful in the world. It truly is wonderful and we do love the area. We’ve even written a book about it (Secret Coast). We’re skipping the coast and heading directly to Alaska as a way to enjoy Alaska by boat when we really can’t get enough time off work to do Alaska at a more conventional, relaxed pace. The other reason to run directly there is it’s a chance to try running 24x7 and see  how it goes. Think of it as an ocean crossing with training wheels. If it gets unpleasant, we can always turn right and head to BC. And, it’ll be an adventure.

 

We’ll be back the first week of September. Have a good rest of your summer,

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Thursday, August 05, 2010 11:50:16 AM (Pacific Standard Time, UTC-08:00)  #    Comments [5] - Trackback
Ramblings
 Sunday, August 01, 2010

A couple of weeks back Greg Linden sent me an interesting paper called Energy Proportional Datacenter Networks. The principal of energy proportionality was first coined by Luiz Barroso and Urs Hölzle in an excellent paper titled The Case for Energy-Proportional Computing. The core principal behind energy proportionality is that computing equipment should consume power in proportion to their utilization level. For example, a computing component that consumes N watts at full load, should consume X/100*N Watts when running at X% load. This may seem like a obviously important concept but, when the idea was first proposed back in 2007, it was not uncommon for a server running at 0% load to be consuming 80% of full load power. Even today, you can occasionally find servers that poor. The incredibly difficulty of maintaining near 100% server utilization makes energy proportionality a particularly important concept.

                                                                                                                         

One of the wonderful aspects of our industry is, when an attribute becomes important, we focus on it. Power consumption as a whole has become important and, as a consequence, average full load server power has been declining for the last couple of years. It is now easy to buy a commodity server under 150W. And, with energy proportionality now getting industry attention, this metric is also improving fast. Today a small percentage of the server market has just barely cross the 50% power proportionality line which is to say, if they are idle, they are consuming less than 50% of the full load power.

 

This is great progress that we’re all happy to see. However, it’s very clear we are never going to achieve 100% power proportionality so raising server utilization will remain  the most important lever we have in reducing wasted power. This is one of the key advantages of cloud computing. When a large number of workloads are brought together with non-correlated utilization peaks and valleys, the overall peak to trough load ratio is dramatically flattened. Looked at it simplistically, aggregating workloads with dissimilar peak resource consumption levels tends to reduce the peak to trough ratios. There need to be resources available for peak resource consumption but the utilization level goes up and down with workload so reducing the difference between peak and trough aggregate load levels, cuts costs, save energy, and is better for the environment.

 

Understanding the importance of power proportionality, it’s natural to be excited by the Energy Proportional Datacenter Networks paper. In this paper, the authors observer “if the system is 15% utilized (servers and network) and the servers are fully power-proportional, the network will consume nearly 50% of the overall power.” Without careful reading, this could lead one to believe that network power consumption was a significant industry problem and immediate action at top priority was needed. But the statement has two problems. The first is the assumption that “full  energy proportionality” is even possible. There will always be some overhead in having a server running. And we are currently so distant from this 100% proportional goal, that any conclusion that follows from this assumption is unrealistic and potentially mis-leading.

 

The second issue is perhaps more important: the entire data center might be running at 15% utilization. 15% utilization means that all the energy (and capital) that went into the datacenter power distribution system, all the mechanical systems, the servers themselves, is only being 15% utilized. The power consumption is actually a tiny part of the problem.  The real problem is the utilization level means that most resources in a nearly $100M investment are being wasted by low utilization levels.There are many poorly utilized data centers running at 15% or worse utilization but I argue the solution to this problem is to increase utilization. Power proportionality is very important and many of us are working hard to find ways to improve power proportionality. But, power proportionality won’t reduce the importance of ensuring that datacenter resources are near fully utilized.

 

Just as power proportionality will never be fully realized, 100% utilization will never be realized either so clearly we need to do both. However, it’s important to remember that the gain from increasing utilization by 10% is far more than the possible gain to be realized by improving power proportionality by 10%. Both are important but utilization is by far the strongest lever in reducing cost and the impact of computing on the environment.

 

Returning to the negative impact of networking gear on overall datacenter power consumption, let’s look more closely. It’s easy to get upset when you look at net gear power consumption. It is prodigious. For example a Juniper EX8200 consumes nearly 10Kw. That’s roughly as much power consumption as an entire rack of servers (server rack powers range greatly but 8 to 10kW is pretty common these days). A fully configured Cisco Nexus 7000 requires 8 full 5kW circuits to be provisioned. That’s 40kW or roughly as much power provisioned to a single network device as would be required by 4 average racks of servers. These numbers are incredibly large individually but it’s important to remember that there really isn’t all that much networking gear in a datacenter relative to the number of servers.

 

To get precise, let’s build a model of the power required in a large datacenter to understand the details of networking gear power consumption relative to the rest of the facility. In this model, we’re going to build a medium to large sized datacenter with 45,000 servers. Let’s assume these servers are operating at an industry leading 50% power proportionality and consume only 150W each at full load. This facility has a PUE of 1.5 which is to say that for every watt of power delivered to IT equipment (servers, storage and networking equipment), there is ½ watt lost in power distribution and powering mechanical systems. PUEs as high as 1.2 are possible but rare and PUEs as poor as 3.0 are possible and unfortunately quite common. The industry average is currently 2.0 which says that in an average datacenter, for every watt delivered to the IT load, 1 watt is spent on overhead (power distribution, cooling, lighting, etc.).

 

For networking, we’ll first build a conventional, over-subscribed, hierarchical network.  In this design, we’ll use Cisco 3560 as a top of rack (TOR) switch. We’ll connect these TORs 2-way redundant to Juniper Ex8216 at the aggregation layer. We’ll connect the aggregation layer to Cisco 6509E in the core were we need only 1 device for a network of this size but we use 2 for redundancy. We also have two border devices in the model:

 

Using these numbers as input we get the following:

 

·         Total DC Power:                                10.74MW (1.5 * IT load)

·         IT Load:                                                7.16MW (netgear + servers and storage)

o   Servers & Storage:          6.75MW (45,100 * 150W)

o   Networking gear:             0.41 MW (from table above)

 

This would have the networking gear as consuming only 3.8% of the overall power consumed by the facility (0.41/10.74). If we were running at 0% utilization which I truly hope is far worse that anyone’s worst case today, what networking consumption would we see? Using the numbers above with the servers at 50% power proportionally (unusually good), we would have the networking gear at 7.2% of overall facility power (0.41/((6.75*0.5+0.41)*1.5)). 

 

This data argues strongly that networking is not the biggest problem or anywhere close. We like improvements wherever we can get them and so I’ll never walk away from a cost effective networking power solution. But, networking power consumption is not even close to our biggest problem so we should not get distracted.

 

What if we built an modern fat tree network using commodity high-radix networking gear along the lines alluded to in Data Center Networks are in my Way and covered in detail in the VL2 and PortLand papers? Using 48 port network devices we would need 1875 switches in the first tier, 79 in the next, then 4, and 1 at the root. Let’s use 4 at the root to get some additional redundancy which would put the switch count at 1,962 devices. Each network device dissipates roughly 150W and driving each of the 48 transceivers requires roughly 5w each  (a rapidly declining number). This totals to 390W per 48 port device. Any I’ve seen are better but let’s use these numbers to ensure we are seeing the network in its worst likely form. Using these data we get:

 

·         Total DC Power:                                11.28MW

·         IT Load:                                                7.52MW

o   Servers & Storage:          6.75MW

o   Networking gear:             0.77MW

 

Even assuming very high network power dissipation rates with 10GigE to every host in the entire datacenter with a constant bisection bandwidth network topology that requires a very large number of devices, we still only have the network at 6.8% of the overall data center. If we assume the very worst case available today where we have 0% utilization with 50% power proportional servers, we get 12.4% power consumed in the network (0.77/((6.75*0.5+0.77)*1.5)).

 

12% is clearly enough to worry about but, in order for the network to be that high, we had to be running at 0% utilization which is to say, that all the resources in the entire data center are being wasted. 0% utilization means we are wasting 100% of the servers, all the power distribution, all the mechanical systems, and the entire networking system, etc. At 0% utilization, it’s not the network that is the problem. It’s the server utilization that needs attention.

 

Note that in the above model more than 60% of the power consumed by the networking devices were the per-port transceivers. We used 5W/port for these but overall transceiver power is expected to drop down to 3W or less over the next couple of years so we should expect a drop in network power consumption of 30 to 40% in the very near future.

 

Summarizing the findings: My take from this data is it’s a rare datacenter where more than 10% of power is going to networking devices and most datacenters will be under 5%. Power proportionality in the network is of some value but improving server utilization is a far more powerful lever. In fact a good argument can be made to spend more on networking gear and networking power if you can use that investment to drive up server utilization by reducing constraints on workload placement.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, August 01, 2010 9:44:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Hardware
 Tuesday, July 13, 2010

High Performance Computing (HPC) is defined by Wikipedia as:

 

High-performance computing (HPC) uses supercomputers and computer clusters to solve advanced computation problems. Today, computer systems approaching the teraflops-region are counted as HPC-computers. The term is most commonly associated with computing used for scientific research or computational science. A related term, high-performance technical computing (HPTC), generally refers to the engineering applications of cluster-based computing (such as computational fluid dynamics and the building and testing of virtual prototypes). Recently, HPC has come to be applied to business uses of cluster-based supercomputers, such as data warehouses, line-of-business (LOB) applications, and transaction processing.

 

Predictably, I use the broadest definition of HPC including data intensive computing and all forms of computational science. It still includes the old stalwart applications of weather modeling and weapons research but the broader definition takes HPC from a niche market to being a big part of the future of server-side computing. Multi-thousand node clusters, operating at teraflop rates, running simulations over massive data sets is how petroleum exploration is done, it’s how advanced financial instruments are (partly) understood, it’s how brick and mortar retailers do shelf space layout and optimize their logistics chains, it’s how automobile manufacturers design safer cars through crash simulation, it’s how semiconductor designs are simulated, it’s how aircraft engines are engineered to be more fuel efficient, and it’s how credit card companies measure fraud risk. Today, at the core of any well run business, is a massive data store –they all have that. The measure of a truly advanced company is the depth of analysis, simulation, and modeling run against this data store.  HPC workloads are incredibly important today and the market segment is growing very quickly driven by the plunging cost of computing and the business value understanding large data sets deeply.

High Performance Computing is one of those important workloads that many argue can’t move to the cloud. Interestingly, HPC has had a long history of supposedly not being able to make a transition and then, subsequently, making that transition faster than even the most optimistic would have guessed possible. In the early days of HPC, most of the workloads were run on supercomputers. These are purpose built, scale-up servers made famous by Control Data Corporation and later by Cray Research with the Cray 1 broadly covered in the popular press. At that time, many argued that slow processors and poor performing interconnects would prevent computational clusters from ever being relevant for these workloads. Today more than ¾ of the fastest HPC systems in the world are based upon commodity compute clusters.

 

The HPC community uses the Top-500 list as the tracking mechanism for the fastest systems in the world. The goal of the Top-500 is to provide a scale and performance metric for a given HPC system. Like all benchmarks, it is a good thing in that it removes some of the manufacturer hype but benchmarks always fail to fully characterize all workloads. They abstract performance to a single or small set of metrics which is useful but this summary data can’t faithfully represent all possible workloads. Nonetheless, in many communities including HPC and Relational Database Management Systems, benchmarks have become quite important. The HPC world uses the Top-500 list which depends upon LINPACK as the benchmark.

 

Looking at the most recent Top-500 list published in June 2010, we see that Intel processors now dominate the list with 81.6% of the entries. It is very clear that the HPC move to commodity clusters has happened. The move that “couldn’t happen” is near complete and the vast majority of very high scale HPC systems are now based upon commodity processors.

 

What about HPC in the cloud, the next “it can’t happen” for HPC? In many respects, HPC workloads are a natural for the cloud in that they are incredibly high scale and consume vast machine resources. Some HPC workloads are incredibly spiky with mammoth clusters being needed for only short periods of time. For example semiconductor design simulation workloads are incredibly computationally intensive and need to be run at high-scale but only during some phases of the design cycle. Having more resources to throw at the problem can get a design completed more quickly and possibly allow just one more verification run to potentially save millions by avoiding a design flaw.  Using cloud resources, this massive fleet of servers can change size over the course of the project or be freed up when they are no longer productively needed. Cloud computing is ideal for these workloads. 

 

Other HPC uses tend to be more steady state and yet these workloads still gain real economic advantage from the economies of extreme scale available in the cloud. See Cloud Computing Economies of Scale (talk, video) for more detail.

 

When I dig deeper into “steady state HPC workloads”, I often learn they are steady state as an existing constraint rather than by the fundamental nature of the work. Is there ever value in running one more simulation or one more modeling run a day? If someone on the team got a good idea or had a new approach to the problem, would it be worth being able to test that theory on real data without interrupting the production runs? More resources, if not accompanied by additional capital expense or long term utilization commitment, are often valuable even for what we typically call steady state workloads. For example, I’m guessing BP, as it battles the Gulf of Mexico oil spill, is running more oil well simulations and tidal flow analysis jobs than originally called for in their 2010 server capacity plan.

 

No workload is flat and unchanging. It’s just a product of a highly constrained model that can’t adapt quickly to changing workload quantities. It’s a model from the past.

 

There is no question there is value to being able to run HPC workloads in the cloud. What makes many folks view HPC as non-cloud hostable is these workloads need high performance, direct access to underlying server hardware without the overhead of the virtualization common in most cloud computing offerings and many of these applications need very high bandwidth, low latency networking. A big step towards this goal was made earlier today when Amazon Web Services announced the EC2 Cluster Compute Instance type.

 

The cc1.4xlarge instance specification:

·         23GB of 1333MHz DDR3 Registered ECC

·         64GB/s main memory bandwidth

·         2 x Intel Xeon X5570   (quad-core Nehalem)

·         2 x 845GB 7200RPM HDDs

·         10Gbps Ethernet Network Interface

 

It’s this last point that I’m particularly excited about. The difference between just a bunch of servers in the cloud and a high performance cluster is the network. Bringing 10GigE direct to the host isn’t that common in the cloud but it’s not particularly remarkable. What is more noteworthy is it is a full bisection bandwidth network within the cluster.  It is common industry practice to statistically multiplex network traffic over an expensive network core with far less than full bisection bandwidth.  Essentially, a gamble is made that not all servers in the cluster will transmit at full interface speed at the same time. For many workloads this actually is a good bet and one that can be safely made. For HPC workloads and other data intensive applications like Hadoop, it’s a poor assumption and leads to vast wasted compute resources waiting on a poor performing network.

 

Why provide less than full bisection bandwidth?  Basically, it’s a cost problem. Because networking gear is still building on a mainframe design point, it’s incredibly expensive. As a consequence, these precious resources need to be very carefully managed and over-subscription levels of 60 to 1 or even over 100 to 1 are common. See Datacenter Networks are in my Way for more on this theme.

 

For me, the most interesting aspect of the newly announced Cluster Compute instance type is not the instance at all. It’s the network. These servers are on a full bisection bandwidth cluster network. All hosts in a cluster can communicate with other nodes in the cluster at the full capacity of the 10Gbps fabric at the same time without blocking. Clearly not all can communicate with a single member of the fleet at the same time but the network can support all members of the cluster communicating at full bandwidth in unison. It’s a sweet network and it’s the network that makes this a truly interesting HPC solution.

 

Each Cluster Compute Instance is $1.60 per instance hour. It’s now possible to access millions of dollars of servers connected by a high performance, full bisection bandwidth network inexpensively. An hour with a 1,000 node high performance cluster for $1,600. Amazing.

 

As a test of the instance type and network prior to going into beta Matt Klein, one of the HPC team engineers, cranked up LINPACK using an 880 server sub-cluster. It’s a good test in that it stresses the network and yields a comparative performance metric. I’m not sure what Matt expected when he started the run but the result he got just about knocked me off my chair when he sent it to me last Sunday.  Matt’s experiment yielded a booming 41.82 TFlop Top-500 run.

 

For those of you as excited as I am interested in the details from the Top-500 LINPACK run:

 This is phenomenal performance for a pay-as-you-go EC2 instance. But what makes it much more impressive is that result would place the EC2 Cluster Compute instance at #146 on the Top-500. It also appears to scale well which is to say bigger numbers look feasible if more nodes were allocated to LINPACK testing. As fun as that would be, it is time to turn all these servers over to customers so we won’t get another run but it was fun.

 

 You can now have one of the biggest super computers in the world for your own private use for $1.60 per instance per hour. I love what’s possible these days.

 

Welcome to the cloud, HPC!

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, July 13, 2010 4:34:30 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Services
 Friday, July 09, 2010

Hierarchical storage management (HSM) also called tiered storage management is back but in a different form. HSM exploits the access pattern skew across data sets by placing cold, seldom accessed data on slow cheap media and frequently accessed data on fast near media.  In old days, HSM typically referred to system mixing robotically managed tape libraries with hard disk drive staging areas. HSM was actually never gone – its just a very old technique to exploit data access pattern skew to reduce storage costs.  Here’s an old unit from FermiLab.

 

Hot data or data currently being accessed is stored on disk and old data that has not been recently accessed is stored on tape. It’s a great way to drive costs down below disk but avoid the people costs of tape library management and to (slightly) reduce the latency of accessing data on tape.

 

The basic concept shows up anywhere where there is data access locality or skew in the access patterns where some data is rarely accessed and some data is frequently accessed. Since evenly distributed, non-skewed access pattern only show up in benchmarks, this concept works on a wide variety of workloads. Processors have cache hierarchies where the top of the hierarchy is very expensive register files and there are multiple layers of increasingly large caches between the register file and memory. Database management systems have large in memory buffer pools insulating access to slow disk. Many very high scale services like Facebook have mammoth in-memory caches insulating access to slow database systems. In the Facebook example, they have 2TB of Memcached in front of their vast MySQL fleet: Velocity 2010.

 

Flash memory again opens up the opportunity to apply HSM concepts to storage.  Rather than using slow tape and fast disk, we use (relatively) slow disk and fast NAND flash. There are many approaches to implementing HSM over flash memory and hard disk drives. Facebook implemented Flashcache which tracks access patterns at the logical volume layer (below the filesystem) in Linux with hot pages written to flash and cold pages to disk.  LSI is a good example implementation done at the disk controller level with their CacheCade product. Others have done it in application specific logic where hot indexing structures are put on flash and cold data pages are written to disk. Yet another approach that showed up around 3 years ago is a Hybrid Disk Drive.

 

Hybrid drives combine large, persistent flash memory caches with disk drives in a single package. When they were first announced, I was excited by them but I got over the excitement once I started benchmarking. It was what looked to be a good idea but the performance really was unimpressive.

 

Hybrid rives still looks like a good idea but now we actually have respectable performance with the Seagate Momentus XT. This part is actually aimed at the laptop market but I’m always watching client progress to understand what can be applied to servers. This finally looks like its heading in the right direction.  See the AnandTech article on this drive for more performance data: Seagate's Momentus XT Reviewed, Finally a Good Hybrid HDD. I still slightly prefer the Facebook FlashCache approach but these hybrid drives are worth watching.

 

Thanks to Greg Linden for sending this one my way.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Friday, July 09, 2010 8:30:14 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Hardware
 Saturday, July 03, 2010

I didn’t attend the Hadoop Summit this year or last but was at the inaugural event back in 2008 and it was excellent. This year, the Hadoop Summit 2010 was held June 29 again in Santa Clara. This agenda for the 2010 event is at: Hadoop Summit 2010 Agenda. Since I wasn’t able to be there, Adam Gray of the Amazon AWS team was kind enough to pass on his notes and let me use them here:

 

Key Takeaways

·         Yahoo and Facebook operate the world largest Hadoop clusters, 4,000/2,300 nodes with 70/40 petabytes respectively. They run full cluster replicas to assure availability and data durability. 

·         Yahoo released Hadoop security features with Kerberos integration which is most useful for long running multitenant Hadoop clusters. 

·         Cloudera released paid enterprise version of Hadoop with cluster management tools and several dB connectors and announced support for Hadoop security.

·         Amazon Elastic MapReduce announced expand/shrink cluster functionality and paid support.

·         Many Hadoop users use the service in conjunction with NoSQL DBs like Hbase or Cassandra.

 

Keynotes

Yahoo had the opening keynote with talks by Blake Irving, Chief Products Officer, Shelton Shugar, SVP of Cloud Computing, and Eric Baldeschwieler, VP of Hadoop. They talked about Yahoo’s scale, including 38k Hadoop servers, 70 PB of storage, and more than 1 million monthly jobs, with half of those jobs written in Apache Pig. Further their agility is improving significantly despite this massive scale—within 7 minutes of a homepage click they have a completely reconfigured preference model for that user and an updated homepage. This would not be possible without Hadoop. Yahoo believes that Hadoop is ready for enterprise use at massive scale and that their use case proves it. Further, a recent study found that 50% of enterprise companies are strongly considering Hadoop, with the most commonly cited reason being agility. Initiatives over the last year include: further investment and improvement in Hadoop 0.20, integration of Hadoop with Kerberos, and the Oozie workflow engine.

 

Next, Peter Sirota gave a keynote for Amazon Elastic MapReduce that focused on how the service makes combining the massive scalability of MapReduce with the web-scale infrastructure of AWS more accessible, particularly to enterprise customers. He also announced several new features including expanding and shrinking the cluster size of running job flows, support for spot instances, and premium support for Elastic MapReduce. Further, he discussed Elastic MapReduce’s involvement in the ecosystem including integration with Karmasphere and Datameer. Finally, Scott Capiello, Senior Director of Products at Microstrategy, came on stage to discuss their integration with Elastic MapReduce.

 

Cloudera followed with talks by Doug Cutting, the creator of Hadoop, and Charles Zedlweski, Senior Director of Product Management. They announced Cloudera Enterprise, a version of their software that includes production support and additional management tools. These tools include improved data integration and authorization management that leverages Hadoops security updates. And they demoed a WebUI for using these management tools.

 

The final keynote was given by Mike Schroepfer, VP of Engineering at Facebook. He talked about Facebook’s scale with 36 PB of uncompressed storage, 2,250 machines with 23k processors, and 80-90 TB growth per day. Their biggest challenge is in getting all that data into Hadoop clusters. Once the data is there, 95% of their jobs are Hive-based. In order to ensure reliability they replicate critical clusters in their entirety.  As far as traffic, the average user spends more time on Facebook than the next 6 web pages combined. In order to improve user experience Facebook is continually improving the response time of their Hadoop jobs. Currently updates can occur within 5 minutes; however, they see this eventually moving below 1 minute. As this is often an acceptable wait time for changes to occur on a webpage, this will open up a whole new class of applications.

 

Discussion Tracks

After lunch the conference broke into three distinct discussion tracks: Developers, Applications, and Research. These tracks had several interesting talks including one by Jay Kreps, Principal Engineer at LinkedIn, who discussed LinkedIn’s data applications and infrastructure. He believes that their business data is ideally suited for Hadoop due to its massive scale but relatively static nature. This supports large amounts of  computation being done offline. Further, he talked about their use of machine learning to predict relationships between users. This requires scoring 120 billion relationships each day using 82 Hadoop jobs. Lastly, he talked about LinkedIn’s in-house developed workflow management tool, Azkaban, an alternative to Oozie.

 

Eric Sammer, Solutions Architect at Cloudera, discussed some best practices for dealing with massive data in Hadoop. Particularly, he discussed the value of using workflows for complex jobs, incremental merges to reduce data transfer, and the use of Sqoop (SQL to Hadoop) for bulk relational database imports and exports. Yahoo’s Amit Phadke discussed using Hadoop to optimize online content. His recommendations included leveraging Pig to abstract out the details of MapReduce for complex jobs and taking advantage of the parallelism of HBase for storage. There was also significant interest in the challenges of using Hadoop for graph algorithms including a talk that was so full that they were unable to let additional people in.

 

Elastic MapReduce Customer Panel

The final session was a customer panel of current Amazon Elastic MapReduce users chaired by Deepak Singh. Participants included: Netflix,  Razorfish, Coldlight Solutions, and Spiral Genetics. Highlights include:

 

·         Razorfish discussed a case study in which a combination of Elastic MapReduce and cascading allowed them to take a customer to market in half the time with a 500% return in ad spend. They discussed how using EMR has given them much better visibility into their costs, allowing them to pass this transparency on to customers.

 

·         Netflix discussed their use of Elastic MapRedudce to setup a hive-based data warehouseing infrastructure. They keep a persistent cluster with data backups in S3 to ensure durability. Further, they also reduce the amount of data transfer through pre-aggregation and preprocessing of data.

 

·         Spiral Genetics talked about they had to leverage AWS to reduce capital expenditures. By using Amazon Elastic MapReduce they were able to setup a running job in 3 hours. They are also excited to see spot instance integration.

 

·         Coldlight Solutions said that buying $1/2M in infrastructure wasn’t even an option when they started. Now it is, but they would rather focus on their strength: machine learning and Amazon Elastic MapReduce allows them to do this.

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Saturday, July 03, 2010 5:39:36 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Thursday, July 01, 2010

I did a talk at Velocity 2010 last week. The slides are posted at Datacenter Infrastructure Innovation and the video is available at Velocity 2010 Keynote. Urs Holze Google Senior VP of infrastructure also did a Velocity keynote. It was an excellent talk and is posted at Urs Holzle at Velocity 2010.  Jonathan Heilliger, Facebook VP of Technical Operations spoke at Velocity as well. A talk summary is up at: Managing Epic Growth in Real Time. Tim O’Reilly did a talk: O’Reilly Radar. Velocity really is a great conference.

 

Last week I posted two quick notes on Facebook: Facebook Software Use and 60,000 Servers at Facebook. Continuing on that theme, a few other Facebook Data points that I have been collecting of late:

 

From Qcon 2010 in Beijing (April 2010): memcached@facebook:

·         How big is Facebook:

o   400m active users

o   60m status updates per day

o   3b photo uploads per month

o   5b pieces of content shared each week

o   50b friend graph edges

§  130 friend per user on average

o   Each user clicks on 9 pieces of content each month

·         Thousands of servers in two regions [jrh: 60,000]

·         Memcached scale:

o   400m gets/second

o   28m sets/second

o   2T cached items

o   Over 200 TB

o   Networking scale:

§  Peak rx: 530m pkts/second (60GB/s)

§  Peak tx: 500m pkts/second (120GB/s)

·         Each memcached server:

o   Rx: 90k pkts/sec (9.7MB/s)

o   Tx 94k pkts/sec (19 MB/s)

o   80k gets/second

o   2k sets/s

o   200m items

·         Phatty Phatty Multiget

o   Php is single threaded and synchronous so need to get multiple objects in a single request to be efficient and fast

·         Cache segregration:

o   Different objects have different lifetimes so separate out

·         Incast problem:

o   The use of multiget increased performance but lead to incast problem

 

The talk is full of good data and worth a read.

 

From Hadoopblog, Facebook has the world’s Largest Hadoop Cluster:

  • 21 PB of storage in a single HDFS cluster
  • 2000 machines
  • 12 TB per machine (a few machines have 24 TB each)
  • 1200 machines with 8 cores each + 800 machines with 16 cores each
  • 32 GB of RAM per machine
  • 15 map-reduce tasks per machine

The Yahoo Hadoop cluster is reported to be twice the node count of the Facebook cluster at 4,000 nodes: Scaling Hadoop to 4000 nodes at Yahoo!. But, it does have less disk:

·         4000 nodes

·         2 quad core Xeons @ 2.5ghz per node

·         4x1TB SATA disks per node

·         8G RAM per node

·         1 gigabit ethernet on each node

·         40 nodes per rack

·         4 gigabit ethernet uplinks from each rack to the core

·         Red Hat Enterprise Linux AS release 4 (Nahant Update 5)

·         Sun Java JDK 1.6.0_05-b13

·         Over 30,000 cores with nearly 16PB of raw disk!

 

http://www.bluedavy.com/iarch/facebook/facebook_architecture.pdf

http://www.qconbeijing.com/download/marc-facebook.pdf

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Thursday, July 01, 2010 7:36:43 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Services

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<November 2010>
SunMonTueWedThuFriSat
31123456
78910111213
14151617181920
21222324252627
2829301234
567891011

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton