Monday, November 30, 2009

Very low-power scale-out servers -- it’s an idea whose time has come. A few weeks ago Intel announced it was doing Microslice servers: Intel Seeks new ‘microserver’ standard. Rackable Systems (I may never manage to start calling them ‘SGI’ – remember the old MIPS-based workstation company?) was on this idea even earlier: Microslice Servers. The Dell Data Center Solutions team has been on a similar path: Server under 30W.

 

Rackable has been talking about very low power servers as physicalization: When less is more: the basics of physicalization. Essentially they are arguing that rather than buying, more-expensive scale-up servers and then virtualizing the workload onto those fewer servers, buy many smaller servers. This saves the virtualization tax which can run 15% to 50% in I/O intensive applications and smaller and low-scale servers can produce more work done per joule and better work done per dollar. I’ve been a believer in this approach for years and wrote it up for the Conference on Innovative Data Research last year in The Case for Low-Cost, Low-Power Servers.

 

I’ve recently been very interested in the application of ARM processors to web-server workloads:

·         Linux/Apache on ARM Processors

·         ARM Cortex-A9 SMP Design Announced

 

ARMs are an even more radical application of the Microslice approach.

 

Scale-down servers easily win on many workloads when looking at work done per dollar and work done per joule and I claim, if you are looking at single dimensional metrics, like performance, you aren’t looking hard enough. However, there are workloads where scale-up wins. They are absolutely required when the workload won’t partition and scale near linearly. Database workloads are classic examples of partition-resistant workloads that really do often run better on more-expensive, scale-up servers.

 

The other limit is administration. Non-automated IT shops believe they are better off with fewer, more-expensive servers although they often achieve this goal by running many operating system images on a single server.  Given that the bulk of administration is spent on the software stack, it’s not clear that this approach of running the same number of O/S images and software stacks on a single server is a substantial savings. However, I do agree that administration costs are important at low-scale. If, at high-scale, admin costs are over 10% of overall operational costs, go fix it rather than buying bigger, more expensive servers.

 

When do scale-up servers win economically? 1) very low-scale workloads where administration costs dominate, and 2) workloads that partition poorly and suffer highly-sub-linear scale-out.  Simple web workloads and other partition-tolerant applications should look to scale-down severs. Make sure your admin costs are sub-10% and don’t scale with server count. Then use work done per dollar and work done per joule and you’ll be amazed to see scale-down gets more done at lower cost and lower power consumption.

 

2010 is the year of the low-cost, scale-down server.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, November 30, 2009 7:04:17 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
Hardware
 Sunday, November 22, 2009

Sometime back I whined that Power Usage Efficiency (PUE) is a seriously abused term: PUE and Total Power Usage Efficiency.  But I continue to use it because it gives us a rough way to compare the efficiency of different data centers.  It’s a simple metric that takes the total power delivered to a facility (total power) and divides it by the amount of power delivered to the servers (critical power or IT load).  A PUE of 1.35 is very good today. Some datacenter owners have claimed to be as good as 1.2.  Conventionally designed data centers operated conservatively are in the 1.6 to 1.7 range.  Unfortunately most of the industry has a PUE of over 2.0, some are as bad as 3.0, and the EPA reports the industry average is 2.0 (Report to Congress on Server Data Center Efficiency). A PUE of 2.0 means that for each watt delivered to the IT load (servers, net gear, and storage), one watt is lost in cooling and in power distribution.

 

Whenever a metric becomes important, managers ask about it and marketing people use it.  Eventually we start seeing data points that are impossibly good. The recent Red Sky installation is one of these events. Sandia National Lab’s Red Sky supercomputer is reported to be delivering a PUE of 1.035 in a system without waste heat recovery. In Red Sky at Night, Sandia’s New Computer Might it is reported “The power usage effectiveness of Red Sky is an almost unheard-of 1.035”. The video referenced below also reports Red Sky at a 1.035 PUE. in response to the claimed PUE of 1.035, Rich Miller of Data Center Knowledge astutely asked “How’s this possible?” (see Red Sky: Supercomputing and Efficiency Meet).  

 

The data center knowledge article links to a blog posting Building Red Sky by Marc Hamilton which includes a wonderful time lapse video showing the building of Red Sky: http://www.youtube.com/watch?v=mNW9cYY4tqc. You should watch the 4 min and 51 second video and I’ll include my notes and observations from the video below. But, before we get to the video, let’s look more closely at the widely reported 1.035 PUE and what it would mean.

 

A PUE of 1.035 implies that for each 1 watt delivered to the servers, 0.035 is lost in power distribution and mechanical systems. For a facility of this size, I suspect they will get delivered high voltage in the 115kV range. In a conventional power distribution design, they will take 115kV and transform it to mid-voltage (13kV range), then to 480V 3p, then to 208V to be delivered to the servers. In addition to all these conversions, there is some loss in the conductors themselves. And there is considerable loss in even the very best uninterruptable power supply (UPS) systems.  In fact, a UPS alone with 3.5% loss is excellent. Excellent power distribution designs will avoid 1 or perhaps 2 of the conversions above and will use a full bypass UPS. But, getting these excellent power distribution designs to even within a factor of 2 of the reported 3.5% loss is incredibly difficult and I’m very skeptical that they are going to get much below 6% to 7%. In fact, if anyone knows how to get down below 6% loss in the power distribution system measured fully, I’m super interested and would love to see what you have done, buy you lunch, and do a datacenter a tour.

 

A 6% loss in power distribution would limit the PUE to nothing lower than 1.06. But, we still have the cooling system to account for. Air is an expensive fluid to move long distances. Consequently, Red Sky brings the water to the server racks using Sun Cooling Door Systems (similar to the IBM iDataPlex Rear Door Cooling system).

 

The Sun Cooling Door System is a nice designs that will significantly improve PUE over more conventional CRAC-based mechanical designs. Generally, bringing water close to the heat load in systems that use water (rather than aggressive free-air only designs) is a good approach. The Sun advertising material credibly reports that “A highly efficient datacenter utilizing a holistic design for closely coupled cooling using Sun Cooling Door Systems can reach a PUE of 1.3”.

 

I know of no way to circulate air through a heat exchanger, pump water to the outside of the building, and then cool the water using any of the many technologies available that can be done at only a 3.5% loss.  Which is to say that a PUE of 1.035 can’t be done with the Red Sky mechanical system design even if power distribution losses were ignored completely. I like Red Sky but suspect we’re looking at a 1.35 PUE system rather than the reported 1.035.  But, that’s OK, 1.35 is quite good and, for a top 10 super computer, it’s GREAT.   

 

Note that a PUE of 1.035 is technically possible with waste heat recovery and, in fact, even less than 1.0 can be achieved with waste heat recovery. See the “PUE less than 1.0” section of PUE and Total Power Usage Efficiency for more data on waste heat recovery.  Remember this is “technically possible” rather than achieved in production today. It’s certainly possible to do today but doing it cost effectively is the challenge.  I have seen it applied to related domains that also have large quantities of low grade heat. For example, a city in Norway is experimenting with waste heat recovery from Sewage: Flush the loo, warm your house.

 

My notes from the Red Sky Video follow:

·         47,232 cores of Intel EM64T Xeon X55xx (Nehalem-EP) 2930 MHz (11.72 GFlops)

o   553 Teraflops

·         Infiniband QDR interconnect

o   1,440 cables totally 9.1 miles

·         Operating System: CentOS

·         Main Memory: 22,104 GB

·         266 VA [jrh: this is clearly incorrect unless they are talking about each server]

o   Each reach is 32kW

·         96 JBOD enclosures

o   2,304 1TB disks

·         12 GB RAM/note & 70TB total

·         PUE 1.035 [jrh: I strongly suspect they meant 1.35]

·         328 tons cooling

·         7.3million gallons of water per year

 

The video is worth watching although if you play with cross referencing the numbers above, there appear to be many mistakes: Red Sky time Lapse.  Thanks to Jeff Bar for sending this one my way.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, November 22, 2009 8:30:19 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Hardware
 Friday, November 20, 2009

I’m on the program committee for the ACM Symposium on Cloud Computing. The conference will be held June 10th and 11th 2010 in Indianapolis Indiana. SOCC brings together database and operating systems researchers and practitioners interested in cloud computing. It is jointly sponsored by the ACM Special Interest Group on Management of Data (SIGMOD) and the ACM Special Interest Group on Operating Systems (SIGOPS). The conference will be held in conjunction with ACM SIGMOD in 2010 and with SOSP in 2011 continuing to alternate between SIGMOD and SOSP in subsequent years.

 

Joe Hellerstein is the SOCC General Chair and Surajit Chaudhuri and Mendel Rosenblum are Program Chairs. The rest of the SOCC organizers are at: http://research.microsoft.com/en-us/um/redmond/events/socc2010/organizers.htm.  If you are interested in cloud computing in general and especially if you are interested in systems or database issues and their application to cloud computing, consider submitting a paper (copied below). The paper submission deadline for SOCC is January 15, 2010. Get writing!

 

                                                                --jrh

 

The ACM Symposium on Cloud Computing 2010 (ACM SOCC 2010) is the first in a new series of symposia with the aim of bringing together researchers, developers, users, and practitioners interested in cloud computing. This series is co-sponsored by the ACM Special Interest Groups on Management of Data (ACM SIGMOD) and on Operating Systems (ACM SIGOPS). ACM SOCC will be held in conjunction with ACM SIGMOD and ACM SOSP Conferences in alternate years, starting with ACM SIGMOD in 2010.

The scope of SOCC Symposia will be broad and will encompass diverse systems topics such as software as a service, virtualization, and scalable cloud data services. Many facets of systems and data management issues will need to be revisited in the context of cloud computing. Suggested topics for paper submissions include but are not limited to:

 

  Administration and Manageability

  Data Privacy

  Data Services Architectures

  Distributed and Parallel Query Processing

  Energy Management

  Geographic Distribution

  Grid Computing

  High Availability and Reliability

  Infrastructure Technologies

  Large Scale Cloud Applications

  Multi-tenancy

  Provisioning and Metering

  Resource management and Performance

  Scientific Data Management

  Security of Services

  Service Level Agreements

  Storage Architectures

  Transactional Models

  Virtualization Technologies

 

Organizers


General Chair:
Joseph M. Hellerstein, U. C. Berkeley

Program Chairs:
Surajit Chaudhuri, Microsoft Research
Mendel Rosenblum, Stanford University

Treasurer:
Brian Cooper, Yahoo! Research

Publicity Chair:
Aman Kansal, Microsoft Research

Steering Committee
Phil Bernstein, Microsoft Research
Ken Birman, Cornell University
Joseph M. Hellerstein, U. C. Berkeley
John Ousterhout, Stanford University
Raghu Ramakrishnan, Yahoo! Research
Doug Terry, Microsoft Research
John Wilkes, Google

Technical Program Committee:
Anastasia Ailamaki, EPFL
Brian Bershad, Google
Michael Carey, UC Irvine
Felipe Cabrera, Amazon
Jeff Chase, Duke
Dilma M da Silva, IBM
David Dewitt, Microsoft
Shel Finkelstein, SAP
Armando Fox, UC Berkeley
Tal Garfinkel, Stanford
Alon Halevy, Google
James Hamilton, Amazon
Jeff Hammerbacher, Cloudera
Joe Hellerstein, UC Berkeley
Alfons Kemper, Technische Universität München
Donald Kossman, ETH
Orran Krieger, Vmware
Jeffrey Naughton, University of Wisconsin, Madison
Hamid Pirahesh, IBM
Raghu Ramakrishnan, Yahoo!
Krithi Ramamritham, Indian Institute of Technology, Bombay
Donovan Schneider, Salesforce.com
Andy Warfield, University of British Columbia
Hakim Weatherspoon, Cornell

Paper Submission

Authors are invited to submit original papers that are not being considered for publication in any other forum. Manuscripts should be submitted in PDF format and formatted using the ACM camera-ready templates available at http://www.acm.org/sigs/pubs/proceed/template.html. See the Paper Submission page for details on the submission procedure.

A submission to the symposium may be one of the following three types:
(a) Research papers: We seek papers on original research work in the broad area of cloud computing. The length of research papers is limited to twelve pages.
(b) Industrial papers: The symposium will also be a forum for high quality industrial presentations on innovative cloud computing platforms, applications and experiences on deployed systems. Submissions for industrial presentations can either be an extended abstract (1-2 pages) or an industrial paper up to 6 pages long.
(c) Position papers: The purpose of a position paper is to expose a new problem or advocate a new approach to an old idea. Participants will be invited based on the submission's originality, technical merit, topical relevance, and likelihood of leading to insightful technical discussions at the symposium. A position paper can be no more than 6 pages long.

Important Dates

Paper Submission: Jan 15, 2010 (11:59pm, PST)
Notification: Feb 22, 2010
Camera-Ready: Mar 22, 2010

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Friday, November 20, 2009 6:24:30 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Saturday, November 14, 2009

HPTS has always been one of my favorite workshops over the years. Margo Seltzer was the program chair this year and she and the program committee brought together one of the best programs ever.  Earlier I posted my notes from Andy Bectolsheim’s session Andy Bechtolsheim at HPTS 2009 and his slides Technologies for Data Intensive Computing.

 

Two other sessions were particularly interesting and worth summarizing here. The first is a great talk on high-scale services lessons learned from Randy Shoup and a talk by John Ousterhout on RAMCloud a research project to completely eliminate the storage hierarchy and store everything in DRAM.

 

Randy Shoup

My notes from Randy’s talk follow and his slides are at: eBay’s Challenges and Lessons from Growing an eCommerce Platform to Planet Scale.

·         eBay Manages

1.       Over 89 million active users worldwide

2.       190 million items for sale in 50,000 categories

3.       Over 8 billion URL requests per day

4.       Roughly 10% of the items are listed or ended each day

5.       70B read/write operations/day

·         Architectural Lessons

1.       Partition Everything

2.       Asynchrony Everywhere

3.       Automate Everything

4.       Remember Everything Fails

5.       Embrace Inconsistency

6.       Expect Service Evolution

7.       Dependencies Matter

8.       Know which databases are Authoritativeand which are caches

9.       Never enough data (save everything)

10.   Invest in custom infrastructure

 

John Ousterhout

My notes from John’s talk follow and his slides are at: RAMCloud: Scalable Data Center Storage Entirely in DRAM. I really enjoyed this talk despite the fact that I saw the same talk presented at the Stanford Clean Slate CTO Summit. This talk is sufficiently thought provoking to be just as interesting the second time through. My notes from John’s talk:

·         Storage entirely in DRAM spread over 10s to 10s of thousands of servers

·         Focus of project:

o   Low latency and very large scale

·         ~64GB server each supporting:

o   1M ops/second

o   5 to 10 us RPC

·         Today commodity servers can stretch easily to 64GB. Expect to see 1TB in commodity servers out 5 to 10 years

·         Current cost is roughly $60/GB. Expect this to fall to $4/GB in 5 to 10 years

·         Motivation for RAMCloud project:

o   Databases don’t scale

·         Disk access rates not keeping up with capacity so disks must become archival:

o   See Jim Gray’s excellent Disk is Tape

·         Aggressive goal of achieving 5 to 10 u sec RPC

·         Points out that very low latency applications are not built upon relational databases and argues that very low data access latency removes the need for optimization of access plans and concludes the relational model will disappear.

o   I see value in low latency but don’t agree that the relational model will disappear. See One Size Does Not Fit All.

·         John makes an interesting observation “the cost of consistency increases with transaction over-lap”

o   Let:

§  0 = # overlapping transactions

§  R = arrival rate for new transactions

§  D = duration of each transaction

o   Then:

§  0 is proportional to R * D

§  R increases with system scale and, eventually, strong consistency becomes unaffordable

§  But, D decreases with lower latency

o   The interesting question: can we afford higher levels of consistency with lower latency?

·         John argues perhaps with very low latency, one size might fit all (a single data storage system could handle all workloads).

o   The counter argument to this one is that capital cost and power cost of an all memory solution appears prohibitively expensive for cold sequential workloads. Its perfect for OLTP but I don’t yet see the “one size can fit again” prediction.

 

If you are interested in digging deeper, the slides for all sessions are posted at: http://www.hpts.ws/agenda.html.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Saturday, November 14, 2009 8:24:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services | Software
 Wednesday, November 11, 2009

In an earlier post Andy Bechtolsheim at HPTS 2009 I put my notes up on Andy Bechtolsheim's excellent talk at HPTS 2009. His slides from that talk are now available: Technologies for Data Intensive Computing. Strongly recommended.

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

Wednesday, November 11, 2009 4:45:37 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Hardware
 Saturday, November 07, 2009

Just about exactly one year ago, I posted a summary and the slides from an excellent Butler Lampson talk: The Uses of Computers: What’s Past is Merely Prologue. Its time for another installment. Butler was at SOSP 2009 a few weeks back and Marvin Theimer caught up with him for a wide ranging discussion on distributed systems.

 

With Butler's permission, what follows are Marvin’s notes from the discussion.

 

Contrast cars with airplanes: when the fancy electronic systems fail you (most-of-the-time) can pull a car over to the side of the road and safely get out whereas an airplane will crash-and-burn.

 

Systems that behave like cars vs. airplanes:

·         It’s like a car if you can reboot it

·         What’s the scope of the damage if you reboot it

 

Ensuring that critical sub-systems keep functioning:

·         Layered approach with lower layers being simpler and able to cut off the higher layers and still keep functioning

·         Bottom layers need to be simple enough to reason about or perhaps even formally verify

·         Be skeptical about designing systems that gracefully degrade/approach their “melting points”.  Nice in theory, but not likely to be feasible in practice in most cases.

·         Have “firewalls” that partition your system into independent modules so that damage is contained.

·         Firewalls have “blast doors” that automatically come down in case of alarms going off.  Under normal circumstances the blast doors are up and you have efficient, optimized interaction between modules.  When alarms go off the blast doors go down.  The system must be able to work in degraded mode with the blast doors down.

·         You need to continually test your system for how it behaves with the blast doors down to ensure that “critical functioning” is still achievable despite system evolution and environment evolution.  Problem is that testing is expensive, so there is a trade-off between extensive testing and cost.  Essentially you can’t test everything.  This is part of the reason why the lowest levels of the system need to be simple enough to formally reason about their behavior.

o   Dave Gifford’s story about bank that had diesel backup generators for when power utility failed.  They religiously tested firing up the backup generators.  However, when a prolonged power actually occurred they discovered that the generators failed after half an hour because their lubricating oil failed.  No one had thought to test running on backup power for more than a few minutes.

 

Low-level multicast is bad because you can’t reason about the consequences of its use.  Better to have application-level multicast where you can explicitly control what’s going on.

 

RPC conundrum:

·         People have moved back from RPC to async messages because of the performance issues of sync RPC.

·         By doing so they are reintroducing concurrency issues into their programs.

A possible solution:

·         Constrain your system (if you can) to enable the definition of a small number of interaction patterns that hide the concurrency and asynchrony.

·         Your experts employ async messages to implement those higher-level interaction patterns.

·         The app developers only use the simper, higher-level abstractions.

·         Be happy with 80% solution – which you might achieve – and don’t expect to be able to handle all interactions this way.

 

Partitioned, primary-key scale-out approach is essentially mimicking the OLTP model of transactional DBs.  You are giving up certain kinds of join operators in exchange for scale-out and the app developer is essentially still programming the simple ACID DB model.

·         Need appropriate patterns/framework for updating multiple objects in a non-transactional manner.

·         Standard approach: update one object and transactionally write a message in a message queue for the other object.  Transactional update to other object is done asynchronously to the first object update.  Need compensation code for when things go wrong.

·         An interesting approach for simplifying the compensation problem: try to turn it into a garbage collection problem.  Background tasks look for to-do messages that haven’t been executed and figure out how to bring the system back into “compliance”.  You need this code for your fsck case anyway.

 

WARNING: don’t over-engineer your system.  Lots of interesting ideas here; you’ll be tempted to over-generalize and make things too complicated.  “Ordinary designers get it wrong 99% of the time; really smart designers get it wrong 50% of the time.”

 

Thanks to Butler Lampson and Marvin Theimer for making this summary available.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Saturday, November 07, 2009 10:45:57 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Tuesday, November 03, 2009

Last week AWS announced the Amazon Relational Database Service (Amazon RDS) and I blogged that it was big step forward for the cloud storage world: Amazon RDS, More Memory, and Lower Prices. This really is an important step forward in that a huge percentage of commercial applications are written to depend upon Relational Databases.  But, I was a bit surprised to get a couple of notes asking about the status of Simple DB and whether the new service was a replacement. These questions were perhaps best characterized by the forum thread The End is Nigh for SimpleDB. I can understand why some might conclude that just having a relational database would be sufficient but the world of structured storage extends far beyond relational systems. In essence, one size does not fit all and both SimpleDB and RDS are important components in addressing the needs of the broader database market.

 

Relational databases have become so ubiquitous that the term “database” is often treated as synonymous with relational databases like Oracle, SQL Server, MySQL, or DB2. However, the term preceded the invention and implementation of the relational model and non-relational data stores remain important today.

 

Relational databases are incredibly rich and able to support a very broad class of applications but with incredible breadth comes significant complexity. Many applications don’t need the rich programming model of relational systems and some applications are better serviced by lighter-weight, easier-to-administer, and easier-to-scale solutions. Both relational and non-relational structured storage systems are important and no single solution is appropriate for all applications. I’ll refer to this broader, beyond-relational database market as “structured storage” to differentiate it from file stores and blob stores.

 

There are a near infinite number of different taxonomies for the structured storage market, but one I find useful is a simple one based upon customer intent: 1) features-first, 2) scale-first, 3) simple structure storage, and 4) purpose-optimized stores. In the discussion that follows, I assume that no database would ever be considered as viable that wasn’t secure and didn’t maintain data integrity.  These are base requirements of any reasonable solutions.

 

Feature-First

The feature-first segment is perhaps the simplest to talk about in that there is near universal agreement. After 35 to 40 years, depending upon how you count, Relational Database Management Systems (RDBMSs) are the structured storage system of choice when a feature-rich solution is needed. Common Feature-First workloads are enterprise financial systems, human resources systems, and customer relationship management systems. In even very large enterprises, a single database instance can often support the entire workload and nearly all of these workloads are hosted on non-sharded relational database management systems.

 

Examples of products that meet this objective well include Oracle, SQL Server, DB2, MySQL, PostgreSQL amongst others. And the Amazon Relational Database Service announced last week is a good example of a cloud-based solution. Generally, the feature-first segment use RDBMSs.

 

Scale-First

The Scale-first segment is considerably less clear and the source of much more debate. Scale-first applications are those that absolutely must scale without bound and being able to do this without restriction is much more important than more features. These applications are exemplified by very high scale web sites such as Facebook, MySpace, Gmail, Yahoo, and Amazon.com. Some of these sites actually do make use of relational databases but many do not. The common theme across all of these services is that scale is more important than features and none of them could possibly run on a single RDBMS. As soon as a single RDBMS instance won’t handle the workload, there are two broad possibilities: 1) shard the application data over a large number of RDBMS systems, or 2) use a highly scalable key-value store.

 

Looking first at sharding over multiple RDBMS instances, this model requires that the programming model be significantly constrained to not expect cross-database instance joins, aggregations, globally unique secondary indexes, global stored procedures, and all the other relational database features that are incredibly hard to scale. Effectively, in this first usage mode, an RDBMS is being used as the implementation but the full relational model is not being exposed to the developer since the full model is incredibly difficult to scale. In this approach, the data is sharded over 10s or even 100s of independent database instances. The Windows Live Messenger group store is an excellent example of the Sharded RDBMS model of Scale-First.

 

There may be some that will jump in and say that DB2 Parallel Edition (DB2 PE, now part of the DB2 Enterprise Edition) and Oracle Real Application Clusters (Oracle RAC) actually do scale the full relational model. I was lucky enough to work closely with the DB2 PE team when I was Lead Architect on DB2 so I know it well. There is no question that both DB2 and RAC are great products but, as good as they are, very high scale sites still typically chose to either 1) shard over multiple instances or 2) use a high-scale, key-value store.

 

This first option, that of using an RDBMS as an implementation component, and sharding data over many instances is a perfectly reasonable and rational approach and one that is frequently used. The second option is to use a scalable key-value store. Some key-value store product examples include Project Voldemort, Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Cassandra, HBase and Hypertable (see Key Value Stores).  Amazon SimpleDB is a good example of a cloud-based offering.

 

Simple Structured Storage

There are many applications that have a structured storage requirement but they really don’t need the features, cost, or complexity of an RDBMS. Nor are they focused on the scale required by the scale-first structured storage segment. They just need a simple key value store. A file system or BLOB-store is not sufficiently rich in that simple query and index access is needed but nothing even close to the full set of RDBMS features is needed. Simple, cheap, fast, and low operational burden are the most important requirements of this segment of the market.

 

Uses of Simple Structured Storage at unremarkable and, as a consequence, there are less visible examples at the low-end of the scale spectrum to reference. Towards the high-end, we have email inbox search at Facebook (using Cassandra), Last.fm reports they will be using Project-Voldemort (using Project-Voldemort), and Amazon uses Dynamo for the retail shopping cart (using Dynamo). Perhaps the widest used example of this class of storage system is Berkeley DB.  On the cloud-side, SimpleDB again is a good example (AdaptiveBlue, Livemocha, and Alexa).

 

Purpose-Optimized Stores

Recently Mike Stonebraker wrote an influential paper titled One Size Fits All: An Idea Whose Time Has Come and Gone. In this paper, Mike argued that the existing commercial RDBMS offerings do not meet the needs of many important market segments. In a presentation with the same title, Stonebraker argues that StreamBase special purpose stream processing system  beat the RDBMS solutions in benchmarks by 27x, that Vertica, a special purpose data warehousing product beat the RDBMS incumbents by never less than 30x, and H-Store (now VoltDB), a special purpose transaction processing system, beat the standard RDBMS offerings by a full 82x.

 

Many other Purpose-Optimized stores have emerged (for example, Aster Data, Netezza, and Greenplum) and this category continues to grow quickly. Clearly there is space and customer need for more than a single solution.

 

Where do SimpleDB and RDS Fit in?

The Amazon RDS service is aimed squarely at the first category above, Feature-First. This is a segment that needs features and mostly uses RDBMS databases. And RDS is amongst the easiest ways to bring up one or more databases quickly and efficiently without needing to hire a database administrator.

 

Amazon SimpleDB is a good solution for the third category, Simple Structured Storage. SimpleDB is there when you need it, is incredibly easy to use, and is inexpensive.  The SimpleDB team will continue to focus on 1) very high availability, 2) supporting scale without bound, 3) simplicity and ease of use, and 4) lowest possible cost and this service will continue to evolve.

 

The second category, scale-first, is served by both SimpleDB and RDS.  Solutions based upon RDS will shard the data over multiple, independent RDS database instances. Solutions based upon SimpleDB will either use the service directly or shard the data over multiple SimpleDB Domains. Of the two approaches, SimpleDB is the easiest to use and more directly targets this usage segment.

 

The SimpleDB team is incredibly busy right now getting ready for several big announcements over the next 6 to 9 months. Expect to see SimpleDB continue to get easier to use while approaching the goal of scaling without bound. The team is working hard and I’m looking forward to the new features being released.

 

The AWS solution for the final important category, purpose optimized storage, is based upon the Elastic Compute Cloud (EC2) and the Elastic Block Store (EBS). EC2 provides the capability to host specialized data engines and EBS provides virtualized storage for the data engine hosted in EC2. This combination is sufficiently rich to support Purpose-Optimized Stores such as Aster Data, Vertica, or Greenplum or any of the commonly used RDBMS offerings such as Oracle, SQL Server, DB2, MySQL, PostgreSQL.

 

The Amazon Web Services plan is to continue to invest deeply in both SimpleDB and RDS as direct structured storage solutions and to continue to rapidly enhance EC2 and EBS to ensure that broadly-used database solutions as well as purpose-built stores run extremely well in the cloud. This year has been a busy one in AWS storage and I’m looking forward to the same pace next year.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, November 03, 2009 6:11:16 AM (Pacific Standard Time, UTC-08:00)  #    Comments [12] - Trackback
Services
 Saturday, October 31, 2009

Recently I came across Steve Souder's Velocity 2009 presentation: High Performance Web Sites: 14 Rules for Faster Loading Pages. Steve is an excellent speaker and the author of two important web performance books:

·         High Performance Web Sites

·         Even Faster Web Sites

 

The reason this presentation caught my interest is it focused on 1) why web sites are slow, 2) what to do about it, and 3) the economics of why you should care. Looking first at the economic argument for faster web sites, many companies are obsessed with site performance but few publish data the economic impact of decreasing web site latency.  The earliest data point I recall coming across from a major web site on the price of latency was from the Marissa Mayer 2008 keynote at Google IO:  Rough Notes from Marissa Mayer Keynote. In an example of Google’s use of A/B testing she reported:

[they surveyed users] would you like 10, 20, or 30 results. Users unanimously wanted 30.

·         But 10 did way better in A/B testing (30 was 20% worse) due to lower latency of 10 results

·         30 is about twice the latency of 10

 

Greg Linden had more detail on this from a similar talk Marissa gave at Web2.0: Marissa Mayer at Web 2.0 where he reported:

Marissa ran an experiment where Google increased the number of search results to thirty. Traffic and revenue from Google searchers in the experimental group dropped by 20%.

 

Ouch. Why? Why, when users had asked for this, did they seem to hate it?

 

After a bit of looking, Marissa explained that they found an uncontrolled variable. The page with 10 results took .4 seconds to generate. The page with 30 results took .9 seconds.

 

Half a second delay caused a 20% drop in traffic. Half a second delay killed user satisfaction.

 

Greg reported he found similar results when working at Amazon:

This conclusion may be surprising -- people notice a half second delay? -- but we had a similar experience at Amazon.com. In A/B tests, we tried delaying the page in increments of 100 milliseconds and found that even very small delays would result in substantial and costly drops in revenue.

 

Being fast really matters. As Marissa said in her talk, "Users really respond to speed."

 

The O’Reilly Velocity 2009 Conference organizers managed to convince some of the big players to present data on the cost of web latency. From a blog posting by Souders Velocity and the Bottom Line

·         Eric Schurman (Bing) and Jake Brutlag (Google Search) co-presented results from latency experiments conducted independently on each site. Bing found that a 2 second slowdown changed queries/user by -1.8% and revenue/user by -4.3%. Google Search found that a 400 millisecond delay resulted in a -0.59% change in searches/user. What's more, even after the delay was removed, these users still had -0.21% fewer searches, indicating that a slower user experience affects long term behavior. (video, slides)

·         Dave Artz from AOL presented several performance suggestions. He concluded with statistics that show page views drop off as page load times increase. Users in the top decile of page load times view ~7.5 pages/visit. This drops to ~6 pages/visit in the 3rd decile, and bottoms out at ~5 pages/visit for users with the slowest page load times. (slides)

·         Marissa Mayer shared several performance case studies from Google. One experiment increased the number of search results per page from 10 to 30, with a corresponding increase in page load times from 400 milliseconds to 900 milliseconds. This resulted in a 25% dropoff in first result page searches. Adding the checkout icon (a shopping cart) to search results made the page 2% slower with a corresponding 2% drop in searches/user. (Watch the video to see the clever workaround they found.) Image optimizations in Google Maps made the page 2-3x faster, with significant increase in user interaction with the site. (video, slides)

·         Phil Dixon, from Shopzilla, had the most takeaway statistics about the impact of performance on the bottom line. A year-long performance redesign resulted in a 5 second speed up (from ~7 seconds to ~2 seconds). This resulted in a 25% increase in page views, a 7-12% increase in revenue, and a 50% reduction in hardware. This last point shows the win-win of performance improvements, increasing revenue while driving down operating costs. (video, slides)

 

Souders presentation included many of the cost of latency data points above and included data from the Alexa Top 10 list to show that the bulk of web page latency is actually front end time rather than server latency:

  

Steve’s 14 rules from his book High Performance Web Sites:

1.       Make fewer HTTP requests

2.       Use a CDN

3.       Add an Expires header

4.       Gzip components

5.       Put stylesheets at the top

6.       Put scripts at the bottom

7.       Avoid CSS expressions

8.       Make JS and CSS external

9.       Reduce DNS lookups

10.   Minify JS

11.   Avoid redirects

12.   Remove duplicate scripts

13.   Configure ETags

14.   Make AJAX cacheable

 

I’ve always believed that speed was an undervalued and under-discussed asset on the web.  Google appears to be one of the early high-traffic sites to focus on low latency as a feature but, until recently, the big players haven’t talked much about the impact of latency. The data from Steve’s talk and his blog entry above is wonderful in that it underlines why low latency really is a feature rather than the result of less features. The rest of his presentation goes into detail into how to achieve low latency web pages. It’s a great talk.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

Saturday, October 31, 2009 7:39:37 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Services
 Tuesday, October 27, 2009

I’ve worked on our around relational database systems for more than 20 years. And, although I freely admit that perfectly good applications can, and often are, written without using a relational database system, it’s simply amazing how many of world’s commercial applications depend upon them. Relational database offerings continue to be the dominant storage choice for applications with a need for structured storage.

 

There are many alternatives, some of which are very good. ISAMs like Berkeley DB. Simple key value stores.  Distributed Hash Tables.  There are many excellent alternatives and, for many workloads, they are very good choices. There is even a movement called Nosql aimed at advancing non-relational choices. And yet, after 35 to 40 years depending upon how you count, relational systems remain the dominant structured storage choice for new applications.

 

Understanding the importance of relational DBs and believing a big part of the server-side computing world is going to end up in the cloud, I’m excited to see the announcement last night of the Amazon Relational Database Service. From the RDS details page:


Amazon RDS is designed for developers or businesses who require the full features and capabilities of a relational database, or who wish to migrate existing applications and tools that utilize a relational database. It gives you access to the full capabilities of a MySQL 5.1 database running on your own Amazon RDS database instance.

To use Amazon RDS, you simply:

·         Launch a database instance (DB Instance), selecting the DB Instance class and storage capacity that best meets your needs.

·         Select the desired retention period (in number of days) for your automated database backups. Amazon RDS will automatically back up your database during your predefined backup window. For typical workloads, this allows you to restore to any point in time within your retention period, up to the last five minutes. You can also restore from a DB Snapshot, a user-initiated backup that can be run at any time with a simple API call.

·         Connect to your DB Instance using your favorite database tool or programming language. Since you have direct access to a full-featured MySQL database, any tool designed for the MySQL engine will work unmodified with Amazon RDS.

·         Monitor the compute and storage resource utilization of your DB Instance, for no additional charge, via Amazon CloudWatch. If at any point you need additional capacity, you can scale the compute and storage resources associated with your DB Instance with a simple API call.

·         Pay only for the resources you actually consume, based on your DB Instance hours consumed, database storage, backup storage, and data transfer.


AWS also announced last night the EC2 High Memory Instance, with over 64GB of memory. This instance type is ideal for large main memory workloads and will be particularly useful for high-scale database work. Databases just love memory.

 

I’ve been excited about cloud computing for years because computing really is less expensive at very high scale. There are substantial cost advantages that come with scale and, at the same time, infrastructure innovations and Moore’s law further contribute to dropping costs. Clearly, industry trends come and go. Those that have lasting impact are the big changes that really do change what can be done and/or allow it to be done at a fundamentally lower cost.   I think it’s great to be working in our industry as we go through one of these fairly dramatic transitions.

 

Consistent with this observation, the AWS EC2 On-Demand instance prices were reduced by up to 15%. From the Amazon announcement:

 

Effective November 1, 2009, Amazon EC2 will be lowering prices up to 15% for all On-Demand instance families and sizes. For example, a Small Standard Linux-based instance will cost just 8.5 cents per hour of computing, compared to the current price of 10 cents per hour.

 

Lower prices, more memory, and a fully managed, easy to use relational database offering.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, October 27, 2009 6:38:49 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Monday, October 26, 2009

I’ve attached below my rough notes from Andy Bechtolsheim’s talk this morning at High Performance Transactions Systems 2009. The agenda for HPTS 2009 is up at: http://www.hpts.ws/agenda.html.

 

Andy is a founder of Sun Microsystems and of Arista Networks and is an incredibly gifted hardware designer. He’s consistently able to innovate and push the design limits while delivering reliable systems that really do work.  It was an excellent talk. Unfortunately, he’s not planning to post the slides so you’ll have to make do with my rough notes below.

 

·        Memory Technologies for Data Intensive Computing

·         Speaker: Andy Bechtolsheim, Sun Microsystems

·         Flash density is increasing at faster than Moore’s law and this is expected to continue

o   Expect 100x improvement over the next 10 or 12 years

·         Emerging technologies are coming

o   Carbon Nano-Tube, Phase-change, Nano-ionic, …

o   But new technologies take time so flash for now

·         Expect: in 2022

o   64x cores but only 500W

o   Would need 2.5 TB/s to memory and 250 GB/s to memory

·         We would have been at 10GHz at 2022 but power density limits makes this impractical

o   Power = clock * capacitance * Vdd^2

·         Most saving will be packaging innovations: Multi-Chip 3D packaging (stacking cpu and many memory chips)

o   More bandwidth through more channels without having to drive more pins (power issue)

·         Expect no more memory per core than today and it could be worse

o   Expect deeper multi-tier memories

·         10Gbps shipping today but expect 25GB in 2012

·         Disks are SOOOOO slow in this context

o   Forget disk for all but sequential and archival storage

·         Sun Flash DIMM

o   30,000 Read IOPS, 20,000 writes

o   Oracle did 7,717,510 tpmC using 24 sun flash devices

·         Not easy to get 10^6 IOPS

o   Limit is disk interface

o   Answer is to go direct to PCI-X PCIe bus [jrh: this is what FusionIO does]

·         Flash very different from DRAM:  100usec to read flash. About 1000x slower than DRAM.

·         Enterprise flash coming:

o   Rather than power optimized 33 Mhz transfers, run 133 Mhz or better

·         Flash Summary:

o   Expect the price of flash to ½ each year and the density to double each year

o   Access times will fall by 50% per year

o   Throughput will double each year

o   Controllers are rapidly improving

o   Interface moving from SATA to PCI-X PCIe

·         Most promising new technologies are stacked chips (thu-Si via stacking) and flash

·         Expect optic volumes to go up and price to go down driven by client side volumes:

o   Intel Light Peak announced $5/client with on board chips

 

Generally Andy is incredibly positive on the continuation of Moors expects this pace of advancement to continue for at least another 10 years. He argues that disk is only useful for cold and sequential workloads and that flash is the future.  Phase Change Memory and other new technologies may eventually replace flash but he points out these changes always take longer than predicted. 

 

Expect flash to stay strong and relevant for the near term and expect it to be PCI-X PCIe connected rather than SATA attached.

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, October 26, 2009 1:09:08 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Hardware
 Saturday, October 24, 2009

I attended the Stanford Clean Slate CTO Summit last week. It was a great event organized by Guru Parulkar. Here’s the agenda:

 

  • 12:00: State of Clean Slate -- Nick McKeown, Stanford
  • 12:30:00pm: Software defined data center networking -- Martin Casado, Nicira
  • 1:00: Role of OpenFlow in data center networking -- Stephen Stuart, Google
  • 2:30: Data center networks are in my way -- James Hamilton, Amazon
  • 3:00: Virtualization and Data Center Networking -- Simon Crosby, Citrix
  • 3:30:RAMCloud: Scalable Datacenter Storage Entirely in DRAM  -- John Ousterhout, Stanford
  • 4:00: L2.5:  Scalable and reliable packet delivery in data centers -- Balaji Prabhakar, Stanford
  • 4:45: Panel: Challenges of Future Data Center Networking--Panelists, James Hamilton, Stephen Stuart, Andrew Lambeth (VMWare), Marc Kwiatkowski (Facebook)

 

I presented Networks are in my Way. My basic premise is that networks are both expensive and poor power/performers. But, much more important, they are in the way of other even more important optimizations. Specifically, because most networks are heavily oversubscribed, the server workload placement problem ends up being seriously over-constrained.  Server workloads need to be near storage, near app tiers, distant from redundant instances, near customer, and often on a specific subnet  due to load balancer or VM migration restrictions. Getting networks out of the way so that servers can be even slightly better utilized will have considerably more impact than many direct gains achieved by optimizing networks.

 

Providing cheap 10Gibps to the host gets networks out of the way by enabling the hosting of many data intensive workloads such as data warehousing, analytics, commercial HPC, and MapReduce workloads. Simply providing more and cheaper bandwidth could potentially have more impact than many direct networking innovations.

 

Networking power/performance is unquestionably poor. I often refer to net gear as the SUV of the data center. However, the biggest gain in power efficiency that networks could enable isn’t in reducing networking power but in getting out of the way and allowing better server utilization. Networking is under 4% of the power consumption in a typical high-scale data center whereas severs are responsible for 44%. I’m arguing that the best networking power innovations are the ones that help make the use of servers more efficient.

 

Looking at networking cost, we see we actually do have a direct problem there. At scale, networking gear represents a full 18% of the cost of all infrastructure (shell, power, power distribution, mechanical systems, servers,&  networking). For every $2.5 spend on servers, roughly $1 is spent on networking. Over time, the ratio of networking gear to servers continues to worsen. I look at this phenomena in more detail in It's the  Eco System Stupid where the commodity server ecosystem is compared to the to the current networking equipment ecosystem.  In my view, the industry needs an competitive, multi-source networking hardware and software stack.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Saturday, October 24, 2009 11:56:21 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Hardware
 Saturday, October 17, 2009

Jeff Dean of Google did an excellent keynote talk at LADIS 2009.  Jeff’s talk is up at: http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf and my notes follow:

·         A data center wide storage hierarchy:

o   Server:

§  DRAM: 16GB, 100ns, 20GB/s

§  Disk: 2TB, 10ms, 200MB/s

o   Rack:

§  DRAM: 1TB, 300us, 100MB/s

§  Disk: 160TB, 11ms, 100MB/s

o   Aggregate Switch:

§  DRAM: 30TB, 500us, 10MB/s

§  Disk: 4.8PB, 12ms, 10MB/s

·         Failure Inevitable:

o   Disk MTBF: 1 to 5%

o   Servers: 2 to 4%

·         Excellent set of distributed systems rules of thumb:

o   L1 cache reference 0.5 ns

o   Branch mispredict 5 ns

o   L2 cache reference 7 ns

o   Mutex lock/unlock 25 ns

o   Main memory reference 100 ns

o   Compress 1K bytes with Zippy 3,000 ns

o   Send 2K bytes over 1 Gbps network 20,000 ns

o   Read 1 MB sequentially from memory 250,000 ns

o   Round trip within same datacenter 500,000 ns

o   Disk seek 10,000,000 ns

o   Read 1 MB sequentially from disk 20,000,000 ns

o   Send packet CA->Netherlands->CA 150,000,000 ns

·         Typical first year for a new cluster:

o   ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)

o   ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)

o   ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)

o   ~1 network rewiring (rolling ~5% of machines down over 2-day span)

o   ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)

o   ~5 racks go wonky (40-80 machines see 50% packetloss)

o   ~8 network maintenances (4 might cause ~30-minute random connectivity losses)

o   ~12 router reloads (takes out DNS and external vips for a couple minutes)

o   ~3 router failures (have to immediately pull traffic for an hour)

o   ~dozens of minor 30-second blips for dns

o   ~1000 individual machine failures

o   ~thousands of hard drive failures

o   slow disks, bad memory, misconfigured machines, flaky machines, etc.

·         GFS Usage at Google:

o   200+ clusters

o   Many clusters of over 1000 machines

o   4+ PB clients

o   40 GB/s read/write laod

·         Map Reduce Usage at Google: 3.5m jobs/year averaging 488 machines each & taking ~8 min

·         Big Table Usage at Google: 500 clusters with largest having 70PB, 30+ GB/s I/O

·         Working on next generation GFS system called Colossus

·         Metadata management for Colossus in BigTable

·         Working on next generation Big Table system called Spanner

o   Similar to BigTable in that Spanner has tables, families, groups, coprocessors, etc.

o   But has hierarchical directories rather than rows, fine-grained replication (ad directory level), ACLs

o   Supports both weak and strong data consistency across data centers

o   Strong consistency implemented using Paxos across replicas

o   Supports distributed transactions across directories/machines

o   Much more automated operation

§  Auto data movement and replicas on basis of computation, usage patterns, and failures

o   Spanner design goals: 10^6 to 10^7 machines, 10^13 directories, 10^18 storage, 10^3 to 10^4 locations over long distances

o   Users specify require latency and replication factor and location

 

I did the keynote at last year’s LADIS: http://perspectives.mvdirona.com/2008/09/16/InternetScaleServiceEfficiency.aspx

 

                                                                --jrh

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Saturday, October 17, 2009 6:06:24 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Services
 Wednesday, October 14, 2009

I love Solid State Disks and have written about them extensively:

And, being a lover of SSD, I know they are a win in many situations including power savings.  However, try as I might (and I have tried hard), I can’t find a way to justify using SSDs on power savings alone. In When SSDs Make Sense in Server Applications I walked through where flash SSDs are a clear win. They are great for replacing disks in VERY high IOPS workloads. They are great for workloads that can’t go do disk and must entirely be held by main-memory caches.  In this usage mode, SSDs can allow the data that was previously had to be memory resident to be moved to SSD. This can be a huge win in that memory is very power intensive and, as much as memory prices have fallen, it’s still expensive relative to disk and flash.

 

In this recently released article, MySpace Replaces all Server Hard Disks with Flash Drives, it was announced that MySpace has completely stopped using hard disks. The article said “MySpace said the solid state storage uses less than 1% of the power and cooling costs that their previous hard drive-based server infrastructure had and that they were able to remove all of their server racks because the ioDrives are embedded directly into even its smallest servers.” Presumably they meant “remove all of the storage racks” rather than “remove all the server racks.”

 

I totally believe that MySpace can and should move some content that currently must live in memory and shift it out to SSDs and I like the savings that will come from doing this. No debate.  I fully expect MySpace have some workloads that drive incredibly high IOPS rates and these can be replaced by SSDs.  But in every company I’ve ever worked or visited, the vast majority of the persistent disk resident data is cold.  Security and audit logs, backups, boot disks, archival copies, debugging information, rarely accessed large objects. Putting cold data without extremely aggressive access time SLAs on flash just doesn’t make sense.  These data are capacity bound rather than IOPS bound and flash is an expensive way to get capacity.

 

The argument made in the article is that power savings of flash over SSD justify the cost per GB delta. I can’t make that math even get close to working. In Annual Fully Burdened Cost of Power we looked at the cost of fully provisioned power and found that if we include power distribution costs, cooling costs, and power, power costs $2.12 per watt per year. Given that a commodity disk (where cold data belongs) consumes 6 to 10w disk each and can store well over a TB, this  math simply can’t be made to work.  I have come across folks that think the power savings justify the technology change even for cold data but I’ve never seen a case where the assertion stood the test of scrutiny.

 

SSDs are wonderful for many applications and I would certainly replace some disks with SSDs but replacing ALL disks doesn’t make sense in the workload mixes found in most data centers.   In this case, I suspect it was a communication error and MySpace has not really replaced all of their disk with SSDs.

 

Thanks to Tom Klienpeter who sent this one my way.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Wednesday, October 14, 2009 11:30:12 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Hardware
 Wednesday, October 07, 2009

In past posts such as Web Search Using Small Cores I’ve said “Atom is a wonderful processor but current memory managers on Atom boards don’t support Error Correcting Codes (ECC) nor greater than 4 gigabytes of memory. I would love to use Atom in server designs but all the data I’ve gathered argues strongly that no server workload should be run without ECC.”  And, in Linux/Apache on ARM Processors I said “unlike Intel Atom based servers, this ARM-based solution has the full ECC Memory support we want in server applications (actually you really want ECC in all applications from embedded through client to servers

 

An excellent paper was just released that puts hard data behind this point and shows conclusively that ECC is absolutely needed. In DRAM Errors in the Wild: A Large Scale Field Study, Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber show conclusively that you really do need ECC memory in server applications. Wolf was also an author of the excellent Power Provisioning in a Warehouse-Sized Computer that I mentioned in my blog post Slides From Conference on Innovative Data Systems Research where the authors described a technique to over-sub subscribe data center power.

 

I continue to believe that client systems should also be running ECC and strongly suspect that a great many kernel and device driver failures are actually the result of memory fault. We don’t have the data to prove it conclusively from a client population but I’ve long suspected that the single most effective way for Windows to reduce their blue screen rate would be to require ECC as a required feature for Windows Hardware Certification.

 

Returning to servers, in Kathy Yelick’s ISCA 2009 keynote, she showed a graph that showed ECC recovery rates (very common) and noted that the recovery times are substantial and the increased latency of correction is substantially slowing the computation (ISCA 2009 Keynote I: How to Waste a Parallel Computer – Kathy Yelick).


This more recent data further supports Kathy’s point, includes wonderfully detailed analysis and concludes with:

 

·         Conclusion 1: We found the incidence of memory errors and the range of error rates across different DIMMs to be much higher than previously reported. About a third of machines and over 8% of DIMMs in our fleet saw at least one correctable error per year. Our per-DIMM rates of correctable errors translate to an average of 25,000–75,000 FIT (failures in time per billion hours of operation) per Mbit and a median FIT range of 778 –25,000 per Mbit (median for DIMMs with errors), while previous studies report 200-5,000 FIT per Mbit. The number of correctable errors per DIMM is highly variable, with some DIMMs experiencing a huge number of errors, compared to others. The annual incidence of uncorrectable errors was 1.3% per machine and 0.22% per DIMM. The conclusion we draw is that error correcting codes are crucial for reducing the large number of memory errors to a manageable number of uncorrectable errors. In fact, we found that platforms with more powerful error codes (chipkill versus SECDED) were able to reduce uncorrectable error rates by a factor of 4–10 over the less powerful codes. Nonetheless, the remaining incidence of 0.22% per DIMM per year makes a crash-tolerant application layer indispensable for large-scale server farms.

·         Conclusion 2: Memory errors are strongly correlated. We observe strong correlations among correctable errors within the same DIMM. A DIMM that sees a correctable error is 13–228 times more likely to see another correctable error in the same month, compared to a DIMM that has not seen errors. There are also correlations between errors at time scales longer than a month. The autocorrelation function of the number of correctable errors per month shows significant levels of correlation up to 7 months. We also observe strong correlations between correctable errors and uncorrectable errors. In 70-80% of the cases an uncorrectable error is preceded by a correctable error in the same month or the previous month, and the presence of a correctable error increases the probability of an uncorrectable error by factors between 9–400. Still, the absolute probabilities of observing an uncorrectable error following a correctable error are relatively small, between 0.1–2.3% per month, so replacing a DIMM solely based on the presence of correctable errors would be attractive only in environments where the cost of downtime is high enough to outweigh the cost of the expected high rate of false positives.

·         Conclusion 3: The incidence of CEs increases with age, while the incidence of UEs decreases with age (due to replacements). Given that DRAM DIMMs are devices without any mechanical components, unlike for example hard drives, we see a surprisingly strong and early effect of age on error rates. For all DIMM types we studied, aging in the form of increased CE rates sets in after only 10–18 months in the field. On the other hand, the rate of incidence of uncorrectable errors continuously declines starting at an early age, most likely because DIMMs with UEs are replaced (survival of the fittest).

·         Conclusion 4: There is no evidence that newer generation DIMMs have worse error behavior. There has been much concern that advancing densities in DRAM technology will lead to higher rates of memory errors in future generations of DIMMs. We study DIMMs in six different platforms, which were introduced over a period of several years, and observe no evidence that CE rates increase with newer generations. In fact, the DIMMs used in the three most recent platforms exhibit lower CE rates, than the two older platforms, despite generally higher DIMM capacities. This indicates that improvements in technology are able to keep up with adversarial trends in DIMM scaling.

·         Conclusion 5: Within the range of temperatures our production systems experience in the field, temperature has a surprisingly low effect on memory errors. Temperature is well known to increase error rates. In fact, artificially increasing the temperature is a commonly used tool for accelerating error rates in lab studies. Interestingly, we find that differences in temperature in the range they arise naturally in our fleet’s operation (a difference of around 20C between the 1st and 9th temperature decile) seem to have a marginal impact on the incidence of memory errors, when controlling for other factors, such as utilization.

·         Conclusion 6: Error rates are strongly correlated with utilization.

·         Conclusion 7: Error rates are unlikely to be dominated by soft errors. We observe that CE rates are highly correlated with system utilization, even when isolating utilization effects from the effects of temperature. In systems that do not use memory scrubbers this observation might simply reflect a higher detection rate of errors. In systems with  memory scrubbers, this observations leads us to the conclusion that a significant fraction of errors is likely due to mechanism other than soft errors, such as hard errors or errors induced on the data path. The reason is that in systems with memory scrubbers the reported rate of soft errors should not depend on utilization levels in the system. Each soft error will eventually be detected (either when the bit is accessed by an application or by the scrubber), corrected and reported. Another observation that supports Conclusion 7 is the strong correlation between errors in the same DIMM. Events that cause soft errors, such as cosmic radiation, are expected to happen randomly over time and not in correlation. Conclusion 7 is an interesting observation, since much previous work has assumed that soft errors are the dominating error mode in DRAM. Some earlier work estimates hard errors to be orders of magnitude less common than soft errors and to make up about 2% of all errors. Conclusion 7 might also explain the significantly higher rates of memory errors we observe compared to previous studies.

 

Based upon this data and others, I recommend against non-ECC servers. Read the full paper at: DRAM Errors in the Wild: A Large Scale Field Study. Thanks for Cary Roberts for pointing me to this paper.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Wednesday, October 07, 2009 5:33:50 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Hardware
 Sunday, October 04, 2009

Data center networks are nowhere close to the biggest cost or the even the most significant power consumers in a data center (Cost of Power in Large Scale Data Centers) and yet substantial networking constraints loom large just below the surface. There are many reasons why we need innovation in data center networks but let’s look at a couple I find particularly interesting and look at the solution we offered in a recent SIGCOMM paper VL2: A Scalable and Flexible Data Center Network.

 

Server Utilization

By far the biggest infrastructure cost in a high-scale service is the servers themselves. The first and most important optimization of server resources is to increase server utilization. The best way to achieve higher server utilization is to run the servers as large homogeneous resource pool where workloads can be run on available servers without constraint. There are (at least) two challenges with this approach: 1) most virtual machine live migration techniques only work within a subnet (a layer 2 network) and 2) compute resources that communicate frequently and in high volume need to be “near” each other.

 

Layer 2 networks are difficult to scale to entire data centers so all but the smallest facilities are made up of many layer 2 subnets each of what might be as small as 20 servers or as large as 500. Scaling layer 2 networks much beyond order 10^3 servers is difficult and seldom done with good results and most are in the O(10^2) range. The restriction of not being able to live migrate workloads across layer 2 boundaries is a substantial limitation on hardware resource balancing and can lead to lower server utilization. Ironically, even though networking is typically only a small portion of the overall infrastructure cost, constraints brought by networking can waste the most valuable components, the servers themselves, through poor utilization.

 

The second impediment to transparent workload placement – the ability to run any workload on any server is driven by the inherent asymmetry typical of data center networks. Most data center networks are seriously over-subscribed. This means there is considerably more bandwidth between servers in the same rack than between racks. And, again, there is considerable more bandwidth between racks on the same aggregation switch than between racks on different aggregation switches through the core routers.  Oversubscription levels of 80 to 1 are common and as much as 240 to 1 can easily be found.  If two servers need to communicate extensively and in volume with each other, then they need to be placed near to each other with respect to the network.  These networking limitations make workload scheduling and placement considerably more difficult and drive reduced levels of server utilization.  Networking is, in effect, “in the way” and blocking the efficient optimization of the most valuable resources in the data center. Server under-utilization wastes much of the capital spent on servers and leaves expensive power distribution and cooling resources underutilized.

 

Data Intensive computing

In the section above, we talked about networking over-subscription levels of 80:1 and higher being common. In the request/response workloads found in many internet services, these over-subscription levels can be tolerable and work adequately well. They are never ideal but they can be sufficient to support the workload. But, for workloads that move massive amounts of data between nodes rather than small amounts of data between the server and the user, oversubscription can be a disaster. Examples of these data intensive workloads are data analysis clusters, many high performance computing workloads, and the new poster child of this workload-type, MapReduce.  MapReduce clusters of hundreds of servers are common and there are many clusters are thousands of servers operating upon petabytes of data. It is quite common for a MapReduce job to transfer the entire multi-petabyte data set over the network during a single job run. This can tax the typically shared networking infrastructure incredibly and the network is often the limiting factor in job performance. Or, worded differently, all the servers and all the other resources in the cluster are being underutilized because of insufficient network capacity.

 

What Needs to Change?

Server utilization can continue to be improved without lifting the networking constraints but, when facing an over-constrained problem, it makes no sense to allow a lower cost component impose constraints on the optimization of a higher cost component. Essentially, the network is in the way. And, the same applies to data intensive computing. These workloads can be run on over-subscribed networks but they don’t run well. Any workload that is network constrained is saving money on the network at the expensive of underutilizing more valuable components such as the servers and storage.

 

The biggest part of the needed solution is lower cost networking gear. The reason why most data centers run highly over-subscribed networks is the expense of high-scale networking gear. Rack switches are relatively inexpensive and, as a consequence, they are seldom over-subscribed. Within the rack bandwidth is usually only limited by the server port speed. Aggregation routers connect rack switches. These implement layer 3 protocols but that’s not the most important differentiator. Many cheap top of rack switches also implement layer 3 protocols.  Aggregation switches are more expensive because they have larger memory, larger routing tables, and they support much higher port counts. Essentially they are networking equivalent of scale-up servers. And, just as with servers, scaling up networking gear drives costs exponentially. These expensive aggregation and core routers force, or strongly encourage, some degree of oversubscription in an effort to get the costs scaling closer to linearly as the network grows.

 

Low cost networking gear is a big part of the solution but it doesn’t address the need to scale the layer 2 network discussed above. The two approaches being looked at to solve this problem are to 1) implement a very large layer 2 network or 2) implement a layer 2 overlay network. Cisco and much of the industry is taking the approach of implementing very large layer 2 networks. Essentially changing and extending layer 2 with layer 3 functionality (see The Blurring of layer 2 and layer 3). You’ll variously see the efforts to scale layer 2 referred to as Data Center Ethernet (DCE) or IEEE Data Center Bridging (DCB).

 

The second approach is to leverage the industry investment in layer 3 networking protocols and implement an overlay network. This was the technique employed by Albert Greenberg and a team of researchers including myself in VL2: A Scalable and Flexible Data Center Network which was published at SIGCOMM 2009 earlier this year. The VL2 project is built using commodity 24-port, 1Gbps switch gear. Rather than using scale-up aggregation and core routers, these low cost, high-radix, commodity routers are cabled to form a Clos network that can reasonably scale to O(10^6) ports. This network topology brings many advantages including: 1) no oversubscription, 2) incredibly robust with many paths between any two ports, 3) inexpensive depending only upon high-volume, commodity components, and 4) able to support large data centers in a single, non-blocking network fabric.

 

The VL2 approach combines the following:

·         Overlay: VL2 is an overlay where all traffic is encapsulated at the source end point and decapsulated destination end point. VL2 separates Location Addresses (PA) from Application Addresses (AA). PAs are the standard hierarchically assigned IP addresses used in the underlying physical network fabric. AAs are the addresses used by the application and the AAs form a single, flat layer 2 address space. Virtual machines can be moved anywhere in the network and still have the same AA. To the application it looks like a single, very-large subnet but, the physical transport network is a conventional layer 3 network with hierarchically assigned IP addresses and subnets. VL2 implements a single flat address space without requiring layer 2 extensions not present in commodity routers and without requiring protocol changes in the application.

·         Central Directory: The directory implements Application Address to Location Address lookup and back in a central directory which keeps the implementation simple, avoid broadcast domain scaling issues, and supports O(10^6) port scaling.

·         Valiant Load Balancing: VLB is used to randomly spread flows over the multipath fabric. Entire flows are spread randomly rather than single packets in a fallow to ensure in-order delivery (all packets on a flow take the same path in the absence of link failure). The paper agrees that spreading packets rather than flows would yield more stable results in the presence of dissimilar flow sizes but experimental results suggest flow spreading may be an acceptable approximation.

 

If you are interested in digging deeper into the VL2 approach:

·         The VL2 Paper: VL2: A Scalable and Flexible Data Center Network

·         An excellent presentation both motivating and discussing VL2: Networking the Cloud

 

In my view, we are on the cusp of big changes in the networking world driven by the availability of high-radix, low-cost, commodity routers coupled with protocol innovations.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, October 04, 2009 4:18:26 PM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Hardware
 Thursday, October 01, 2009

Microsoft’s Chicago data center was just reported to be online as of July 20th. Data Center Knowledge published an interesting and fairly detailed report in: Microsoft Unveils Its Container-Powered Cloud. 

 

Early industry rumors were that Rackable Systems (now SGI but mark me down as confused on how that brand change is ever going to help the company) had won the container contract for the lower floor of Chicago. It appears that the Dell Data Center Solutions team has now has the business and 10 of the containers are from DCS.

 

The facility is reported to be a ½ billion dollar facility of 700,000 square feet. The upper floor is a standard data center whereas the lower floor is the world’s largest containerized deployment. Each container holds 2,000 servers and ½MW of critical load. The entire lower floor when fully populated will house 112 containers and 224,000 servers.

 

Data Center Knowledge reports:

 

The raised-floor area is fed by a cooling loop filled with 47-degree chilled water, while the container area is supported by a separate chilled water loop running at 65 degrees. Of the facility’s total 30-megawatt power capacity, about 20 megawatts is dedicated to the container area, with about 10 megawatts for the raised floor pods. The power infrastructure also includes 11 power rooms and 11 diesel generators, each providing 2.8 megawatts of potential backup power that can be called upon in the event of a utility outage.

 

Unlike Dublin which uses a very nice air-side economization design, Chicago is all water cooled with water side economization but no free air cooling at all.

 

One of the challenges of container systems is container handling. These units can weight upwards of 50,000 lbs and are difficult to move and the risk of a small mistake by a crane operator is substantial not to mention the cost of gantry cranes to move them around. The Chicago facility takes a page from advanced material handling and slides the containers on air skates over the polished concrete floor. Just 4 people can move a 2 container stack into place. It’s a very nice approach.

 

The entire facility is reported to be 30MW total load but 112 containers would draw 56MW critical load. So we know the 30MW number is an incremental build-out point rather than the facility's fully built size. Once completed, I would estimate it will be closer to 80MW of critical load and around 110MW of total power (assuming 1.35 PUE).

 

                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Thursday, October 01, 2009 6:01:09 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Hardware
 Sunday, September 27, 2009

I recently came across an interesting paper that is currently under review for ASPLOS. I liked it for two unrelated reasons: 1) the paper covers the Microsoft Bing Search engine architecture in more detail than I’ve seen previously released, and 2) it covers the problems with scaling workloads down to low-powered commodity cores clearly. I particularly like the combination of using important, real production workloads rather than workload models or simulations and using that base to investigate an important problem: when can we scale workloads down to low power processors and what are the limiting factors?

 

The paper: Web Search Using Small Cores: Quantifying the Price of Efficiency.

Low Power Project Team Site: Gargoyle: Software & Hardware for Energy Efficient Computing

 

I’ve been very interested in the application of commodity, low-power processors to produce service workloads for years and wrote up some of the work done during 2008 for the Conference on Innovative Data Systems Research in a paper CEMS: Low-Cost, Low-Power Servers for Internet-Scale Services and presentation. And several blog entries since that time:

This paper uses an Intel Atom as the low powered, commodity processor under investigation and compares it with Intel Harpertown. It would have been better to use Intel Nehalem as the server processor of comparison. Nehalem is a dramatic step forward in power/performance over Harpertown. But using Harpertown didn’t change any of the findings reported in the paper so it’s not a problem.

 

On the commodity, low-power end, Atom is a wonderful processor but current memory managers on Atom boards don’t support ECC nor greater than 4 gigabytes of memory. I would love to use Atom in server designs but all the data I’ve gathered argues strongly that no server workload should be run without ECC. Intel clearly has memory management units with the appropriate capabilities so it’s obviously not technical problems that leave Atom without ECC. The low-powered AMD part used in CEMS does include ECC as does the ARM I mentioned in the recent blog entry ARM Cortex-A9 SMP Design Announced.

 

Most “CPU bound” workloads are actually not CPU bound but limited by memory. The CPU will report busy but it is actually spending most of its time in memory wait states. How can you tell if your workload is actually memory bound or CPU bound?  Look at Cycles Per Instruction, the number of cycles that each instruction takes. Super scalar processors should be dispatching many instructions per cycle (CPI << 1.0) but memory wait state on most workloads tend limit CPIs to over 1.  Branch intensive workloads that touch large amounts of memory tend to have high CPI counts whereas cache resident workloads will be very low and potentially less than 1.  I’ve seen operating system code with the CPI more than 7 and I’ve seen database systems in the 2.4 range. More optimistic folks than I, tend to look at the reciprocal of CPI, instructions per cycle but it’s the same data. See my Rules of Thumb post for more discussion of CPI.

 

In figure 1, the paper shows the instructions per Cycle (IPC which is 1/CPI) of Apache, MySQL, JRockit, DBench, and Bing.  As I mentioned above, if you give server workloads sufficient disk and network resources, they typically become memory bound. A CPI of 2.0 or greater are typical of commercial server workloads and well over 3.0 is common. As we expected, all the public server workloads in Figure 1 are right around a CPI of 2.0 (IPC roughly equal to 0.5).  Bing is the exception with a IPC CPI of nearly 1.0. This means that Bing is almost twice as computationally intensive than typical server workloads. This is an impressively good CPI and makes this workload particularly hard to run on low-power, low-cost, commodity processors. The authors choice of this very difficult workload to study allows them to clearly see the problems of scaling down server workloads and makes the paper better. Effectively using a difficult workload draws out and make more obvious the challenges of scaling down workloads to low-power processors. We need to keep in mind that most workloads, in fact, nearly all server workloads are a factor of 2 less computationally intensive and therefore easier to host on low-powered servers.

 

The lessons I got from the paper are: Unsurprisingly Atom showed much better power/performance than Harpertown but offered considerably less performance head room. Conventional server processors are capable of very high-powered bursts of performance but typically operate in lower performance states. When you need to run a short computational intensive segment of code, the performance is there.  Low power processors operate in steady state nearer to their capabilities limits. The good news is they operate nearly an order of magnitude more efficiently than the high powered server processors but they don’t have the ability to deliver the computational bursts at the same throughput.

 

Given that low-powered processors are cheap, over-provisioning is the obvious first solution. Add more processors and run them at lower average utilization in order to have the headroom to be able to process computationally intensive code segments without slowdown. Over-provisioning helps with throughput and provides the headroom to handle computationally intensive code segments but doesn’t help with the latency problem.  More cores will help most server workloads but, on those with both very high computational intensity (CPI near 1 or lower) and needing very low latency, only fast cores can fully address the problem. Fortunately, these workloads are not the common case.

 

Another thing to keep in mind is, if you improve the price/performance and power/performance of processors greatly, other server components begin to dominate. I like to look at extremes to understand these factors.  What if the processor was free and consumed zero power?  The power consumption of memory and glue chips would dominate and the cost of all the other components would put a floor on the server cost. This argues for at least 4 server design principles: 1) memory is on track to be the biggest problem so we need low cost, power efficient memories, 2) very large core counts help amortize the cost of all the other server components and helps manage the peak performance problem, 3) as the cost of the server is scaled down, it makes sense to share some components such as power supplies, and 4) servers will never be fully balanced (all resources consumed equally) for all workloads so we’ll need the ability to take resources to low-power states or even to depower them.  Intel Nehalem does some of this later point and mobile phone processors like ARM are masters of it.

 

If you are interested in high scale search, the application of low-power commodity processors to service workloads, or both, this paper is a good read.

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, September 27, 2009 7:21:58 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Hardware
 Thursday, September 24, 2009

This is 100% the right answer: Microsoft’s Chiller-less Data Center. The Microsoft Dublin data center has three design features I love: 1) they are running evaporative cooling, 2) they are using free-air cooling (air-side economization), and 3) they run up to 95F and avoid the use of chillers entirely. All three of these techniques were covered in the best practices talk I gave at the Google Data Center Efficiency Conference  (presentation, video).

 

Other blog entries on high temperature data center operation:

·  Next Point of Server Differentiation: Efficiency at Very High Temperature

·  Costs of Higher Temperature Data Centers?

·  32C (90F) in the Data Center

 

Microsoft General Manager of Infrastructure Services Arne Josefsberg blog entry on the Dublin facility: http://blogs.technet.com/msdatacenters/archive/2009/09/24/dublin-data-center-celebrates-grand-opening.aspx.

 

In a secretive industry like ours, it’s good to see a public example of a high-scale data center running hot and without chillers. Good work Microsoft.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Thursday, September 24, 2009 10:37:27 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Hardware
 Monday, September 21, 2009

Here’s another innovative application of commodity hardware and innovative software to the high-scale storage problem. MaxiScale focuses on 1) scalable storage, 2) distributed namespace, and 3) commodity hardware.

 

Today's announcement: http://www.maxiscale.com/news/newsrelease/092109.

 

They sell software designed to run on commodity servers with direct attached storage. They run N-way redundancy with a default of 3-way across storage servers to be able to survive disk and server failure. The storage can be accessed via HTTP or via Linux or Windows (2003 and XP) file system calls. The later approach requires a kernel installed device driver and uses a proprietary protocol to communicate back with the filer cluster but has the advantage of directly support local O/S read/write operations. MaxiScale architectural block diagram:

Overall I like the approach of using commodity systems with direct attached storage as the building block for very high scale storage clusters but that is hardly unique. Many companies have head down this path and, generally, it’s the right approach. What caught my interest when I spoke to the MaxiScale team last week was: 1) distributed metadata, 2) MapReduce support, and 3) small file support. Let’s look at each of these major features:

 

Distributed Metadata

File systems need to maintain a namespace. We need to maintain the directory hierarchy and we need to know where to find the storage blocks that make up the file.  In addition, other attributes and security may need to be stored depending upon the implemented file system semantics. This metadata is often stored in large key/value store. The metadata requires at least some synchronization since, for example, you don’t want to create two different objects of the same name at roughly the same time.  At high scale, storage servers will be joining and leaving the cluster all the time, so having a central metadata service is a easy approach to the problem. But, as easy as it is to implement a central metadata systems, they bring scaling limits. Eventually the metadata gets too hot and needs to be partitioned. In fairness, it’s amazing how far central metadata can be scaled but eventually hot spots develop and it needs to be partitioned. For example, Google GFS just went down this path: GFS: Evolution on Fast-forward. Partitioning metadata is a fairly well understood problem. What makes it a bit of a challenge is making the metadata system adaptive and able to re-partition when hot spots develop.

 

MaxiScale took an interesting approach to scaling the metadata. They distributed the metadata servers over the same servers that store the data rather than implement a cluster of dedicated metadata servers. They do this by hashing on the parent directory to find what they call a Peer Set and then, in that particular Peer Set, they look up the object name in the metadata store, find the file block location, and then apply the operation to the blocks in that same Peer Set.

 

Distributing the metadata over the same Peer Set as the stored data means that each peer set is independent and self-describing. The downside of having a fixed hash over the peer sets is that it’s difficult to cool down an excessively hot peer set by moving objects since the hash is known by all clients.

 

MapReduce Support

I love the MaxiScale approach to multi-server administration. They need to provide customers the ability to easily maintain multi-server clusters. They could have implemented a separate control plane to manage the all the servers that make up the cluster but, instead, they just use Hadoop and run MapReduce jobs.

 

All administrative operations are written as simple MapReduce jobs which certainly made the implementation task easier but it’s also a nice, extensible interface to allow customers to write custom administrative operations. And, since MapReduce is available over the cluster, its super easy to write data mining and data analysis jobs. Supporting MapReduce over the storage system is a nice extension of normal filer semantics.

 

Small File Support

The standard file system access model is to probe the metadata to get the block list and then access the blocks to perform the file operation. In the common case, this will require two I/Os which is fine for large files but the two I/Os can be expensive for small file access. And, since most filers use fixed size blocks and, for efficiency these block size tend to be larger than the average small file, some space is wasted. The common approach to this two problems is to pull small files “up” and rather than store the list of storage blocks in the file metadata, just store the small file. This works fine for small files and avoids both the block fragmentation and the multiple I/O problem on small files. This is what MaxiScale has done as well and they claim single I/O for any small file stored in the system.

 

More data on the MaxiScale filer: Small Files, Big Headaches: Ensuring Peak Performance

 

I love solutions based upon low-cost, commodity H/W and application maintained redundancy and what MaxiScale is doing has many of the features I would like to see in a remote filer.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

 

Monday, September 21, 2009 6:41:49 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Software
 Wednesday, September 16, 2009

ARM just announced a couple of 2-core SMP design based upon the Cortex-A9 application processor, one optimized for performance and the other for power consumption (http://www.arm.com/news/25922.html). Although the optimization points are different, both are incredibly low power consumers by server standards with the performance-optimized part dissipating only 1.9W at 2Ghz based upon the TSMC 40G process (40nm). This design is aimed at server applications and should be able to run many server workloads comfortably.

 

In Linux/Apache on ARM Processors I described an 8 server cluster of web servers running the Marvell MV78100. These are single core ARM design servers produced by Marvell. It’s a great demonstration system showing that web server workloads can be run cost effectively on ARM based servers. Toward the end of the blog entry, I observed:

 

The ARM is a clear win on work done per dollar and work done per joule for some workloads. If a 4-core, cache coherent version was available with a reasonable memory controller, we would have a very nice server processor with record breaking power consumption numbers.

 

I got a call from ARM soon after posting saying that I may get my wish sooner than I was guessing. Very cool. The Design that was announced earlier today includes a 2-core, performance optimized design that could form the building block of a very nice server. In the following block diagram, ARM  shows a pair of 2-core macros implementing a 4-way SMP:

Some earlier multi-core ARM designs such the Marvel MV78200 are not cache coherent which makes it difficult to support a single application utilizing both cores. As long as this design is coherent (and I believe it is), I love it.  

 

Technically it’s long been possible to build N-way SMP servers based upon the single core Cortex-A9 macros but it’s quite a bit of design work. The 2-way single macro makes it easy to deliver at least 2-core servers and this announcement shows that ARM is interested in and is investing in developing the ARM-based server market.

 

The ARM reported performance results:

 

In the ARM business model, the release of a design is the first and most important step towards parts becoming available from partners. However, it’s typically at least 12 months from design availability to first shipping silicone from partners so we won’t likely see components based upon this design until late 2010 at the earliest. I’m looking forward to it.

 

Our industry just keeps getting more interesting.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdriona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Wednesday, September 16, 2009 4:05:24 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Hardware

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<November 2009>
SunMonTueWedThuFriSat
25262728293031
1234567
891011121314
15161718192021
22232425262728
293012345

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton