Saturday, March 05, 2011

Chris Page, Director of Climate & Energy Strategy at Yahoo! spoke at the 2010 Data Center Efficiency Summit and presented Yahoo! Compute Coop Design.

 The primary attributes of the Yahoo! design are: 1) 100% free air cooling (no chillers), 2) slab concrete floor, 3) use of wind power to augment air handling units, and 4) pre-engineered building for construction speed.

 

Chris reports the idea to orient the building such that the wind force on the external wall facing the dominant wind direction and use this higher pressure to assist the air handling units was taken from looking at farm buildings in the Buffalo, New York area. An example given was the use of natural cooling in chicken coops.

Chiller free data centers are getting more common (Chillerless Data Center at 95F) but the design approach is still far from common place. The Yahoo! team reports they run the facility at 75F and use evaporative cooling to attempt to hold the server approach temperatures down below 80F using evaporative cooling.  A common approach and the one used by Microsoft in the chillerless datacenter at 95F example is install low efficiency coolers for those rare days when they are needed on the logic that low efficiency is cheap and really fine for very rare use. The Yahoo! approach is to avoid the capital cost and power consumption of chillers entirely by allowing the cold aisle temperatures to rise to 85F to 90F when they are unable to hold the temperature lower. They calculate they will only do this 34 hours a year which is less than 0.4% of the year.

 

Reported results:

·         PUE at 1.08 with evaporative cooling and believe they could do better in colder climates

·         Save ~36 million gallons of water / yr, compared to previous water cooled chiller plant w/o airside economization design

·         Saves over ~8 million gallons of sewer discharge / yr (zero discharge design)

·         Lower construction cost (not quantified)

·         6 months from dirt to operating facility

 

Chris’ slides: http://dcee.svlg.org/images/stories/pdf/DCES10/airandwatercafe130.pdf (the middle slides of the 3 slide deck set)

 

Thanks to Vijay Rao for sending this my way.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Saturday, March 05, 2011 1:58:04 PM (Pacific Standard Time, UTC-08:00)  #    Comments [7] - Trackback
Hardware | Services
 Sunday, February 27, 2011

Datacenter temperature has been ramping up rapidly over the last 5 years. In fact, leading operators have been pushing temperatures up so quickly that the American Society of Heating, Refrigeration, and Air-Conditioning recommendations have become a become trailing indicator of what is being done rather than current guidance. ASHRAE responded in January of 2009 by raising the recommended limit from 77F to 80.6F (HVAC Group Says Datacenters Can be Warmer). This was a good move but many of us felt it was late and not nearly a big enough increment.  Earlier this month, ASHRAE announced they are again planning to take action and raise the recommended limit further but haven’t yet announced by how much (ASHRAE: Data Centers Can be Even Warmer). 

 

Many datacenters are operating reliably well in excess even the newest ASHRAE recommended temp of 81F. For example, back in 2009 Microsoft announced they were operating at least one facility without chillers at 95F server inlet temperatures.

 

As a measure of datacenter efficiency, we often use Power Usage Effectiveness. That’s the ratio of the power that arrives at the facility divided by the power actually delivered to the IT equipment (servers, storage, and networking). Clearly PUE says nothing about the efficiency of servers which is even more important but, just focusing on facility efficiency, we see that mechanical systems are the largest single power efficiency overhead in the datacenter.

 

There are many innovative solutions to addressing the power losses to mechanical systems. Many of these innovations work well and others have great potential but none are as powerful as simply increasing the server inlet temperatures. Obviously less cooling is cheaper than more. And, the higher the target inlet temperatures, the higher percentage of time that a facility can spend running on outside air (air-side economization) without process-based cooling.

 

The downsides of higher temperatures are 1) high semiconductor leakage losses, 2) higher server fan speed which increases the losses to air moving, and 3) higher server mortality rates. I’ve measured the former and, although these losses are inarguably present, these losses are measureable but have a very small impact at even quite high server inlet temperatures.  The negative impact of fan speed increases is real but can be mitigated via different server target temperatures and more efficient server cooling designs. If the servers are designed for higher inlet temperatures, the fans will be configured for these higher expected temperatures and won’t run faster. This is simply a server design decision and good mechanical designs work well at higher server temperatures without increased power consumption. It’s the third issue that remains the scary one: increased server mortality rates.

 

The net of these factors is fear of server mortality rates is the prime factor slowing an even more rapid increase in datacenter temperatures. An often quoted study eports the failure rate of electronics doubles with every 10C increase of temperature (MIL-HDBK 217F). This data point is incredibly widely used by the military, NASA space flight program, and in commercial electronic equipment design. I’m sure the work is excellent but it is a very old study, wasn’t focused on a large datacenter environment, and the rule of thumb that has emerged from is a linear model of failure to heat.

 

This linear failure model is helpful guidance but is clearly not correct across the entire possible temperatures range. We know that at low temperature, non-heat related failure modes dominate. The failure rate of gear at 60F (16C) is not ½ the rate of gear at 79F (26C). As the temperature ramps up, we expect the failure rate to increase non-linearly. Really what we want to find is the knee of the curve. When does the failure rate increase start to approach that of the power savings achieved at higher inlet temperatures?  Knowing that the rule of thumb that 10C increase double the failure rate is incorrect doesn’t really help us. What is correct?

 

What is correct is a difficult data point to get for two reasons: 1) nobody wants to bear the cost of the experiment – the failure case could run several hundred million dollars, and 2) those that have explored higher temperatures and have the data aren’t usually interested in sharing these results since the data has substantial commercial value.

 

I was happy to see a recent Datacenter Knowledge article titled What’s Next? Hotter Servers with ‘Gas Pedals’ (ignore the title – Rich just wants to catch your attention). This article include several interesting tidbits on high temperature operation. Most notable was the quote by Subodh Bapat, the former VP of Energy Efficiency at Sun Microsystems, who says:

 

Take the data center in the desert. Subodh Bapat, the former VP of Energy Efficiency at Sun Microsystems, shared an anecdote about a data center user in the Middle East that wanted to test server failure rates if it operated its data center at 45 degrees Celsius – that’s 115 degrees Fahrenheit.

 

Testing projected an annual equipment failure rate of 2.45 percent at 25 degrees C (77 degrees F), and then an increase of 0.36 percent for every additional degree. Thus, 45C would likely result in annual failure rate of 11.45 percent. “Even if they replaced 11 percent of their servers each year, they would save so much on air conditioning that they decided to go ahead with the project,” said Bapat. “They’ll go up to 45C using full air economization in the Middle East.”

 

This study caught my interest for a variety of reasons. First and most importantly, they studied failure rates at 45C and decided that, although they were high, it still made sense for them to operate at these temperatures. It is reported they are happy to pay the increased failure rate in order to save the cost of the mechanical gear. The second interesting data point from this work is that they have found a far greater failure rate for 15C of increased temperature than predicted by MIL-HDBK-217F. Consistent with the217F, they also report a linear relationship between temperature and failure rate. I almost guarantee that the failure rate increase between 40C and 45C was much higher than the difference between 25C and 30C. I don’t buy the linear relationship between failure rate and temperature based upon what we know of failure rates at lower temperatures. Many datacenters have raised temperatures between 20C (68F) and 25C (77F) and found no measureable increase in server failure rate. Some have raised their temperatures between 25C (77F) and 30C (86F) without finding the predicted 1.8% increase in failures but admitted the data remains sparse.

 

The linear relationship is absolutely not there across a wide temperature range. But I still find the data super interesting in two respects: 1) they did see a roughly a 9% increase in failure rate at 45C over the control case at 20C and 2) even with the high failure rate, they made an economic decision to run at that temperature. The article doesn’t say if the failure rate is a lifetime failure rate or an Annual Failure Rate (AFR). Assuming it is an AFR, many workloads and customers could not life with a 11.45% AFR nonetheless, the data is interesting and it good to see the first public report of an operator doing the study and reaching a conclusion that works for their workloads and customers.

 

The net of the whole discussion above is the current linear rules of thumb are wrong and don’t usefully predict failure rates in the temperature ranges we care about, there is very little public data out there, industry groups like ASHRAE are behind and it’s quite possible that cooling specialist may not be the first to recommend we stop cooling datacenters aggressively :-). Generally, it’s a sparse data environment out there but the potential economic and environmental gains are exciting so progress will continue to get made. I’m looking forward to efficient and reliable operation at high temperature becoming a critical data point in server fleet purchasing decisions. Let’s keep pushing.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, February 27, 2011 10:25:43 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Hardware
 Sunday, February 20, 2011

Dileep Bhandarkar presented the keynote at Server design Summit last December.  I can never find the time to attend trade shows so I often end up reading slides instead. This one had lots of interesting tidbits so I’m posting a pointer to the talk and my rough notes here.

 

Dileep’s talk: Watt Matters in Energy Efficiency

 

My notes:

·         Microsoft Datacenter Capacities:

o   Quincy WA: 550k sq ft, 27MW

o   San Antonio Tx: 477k sq ft, 27MW

o   Chicago, Il: 707k sq ft, 60MW

§  Containers on bottom floor with “medium reliability”  (no generators) and standard rooms on top floor with full power redundancy

o   Dublin, Ireland: 570k sq ft, 27MW

·         Speeds and feeds from Microsoft Consumer Cloud Services:

o   Windows Live: 500M IDs

o   Live Hotmail: 355M Active Accounts

o   Live Messenger: 303M users

o   Bing: 4B Queries/month

o   Xbox Live: 25M users    

o   adCenter: 14B Ads served/month

o   Exchange Hosted Services: 2 to 4B emails/day

·         Datacenter Construction Costs

o   Land: <2%

o   Shell: 5 to 9%

o   Architectural: 4 to 7%

o   Mechanical & Electrical: 70 to 85%

·         Summarizing the above list, we get 80% of the costs scaling with power consumption and 10 to 20% scaling with floor space. Reflect on that number and you’ll understand why I think the industry is nuts to be focusing on density. See Why Blade Servers Aren’t the Answer to All Questions for more detail on this point – I think it’s a particularly important one.

·         Reports that datacenter build costs are $10M to $20M per MW and server TCO is the biggest single category. I would use the low end of this spectrum for a cost estimator with inexpensive facilities in the $9M to $10M/MW range. See Overall Datacenter Costs.

·         PUE is a good metric for evaluating datacenter infrastructure efficiency but Microsoft uses best server performance per watt per TCO$

o   Optimize the server design and datacenter together

·         Cost-Reduction Strategies:

o   Server cost reduction:

§  Right size server: Low Power Processors often offer best performance/watt (see Very Low-Power Server Progress for more)

§  Eliminate unnecessary components (very small gain)

§  Use higher efficiency parts

§  Optimize for server performance/watt/$ (cheap and low power tend to win at scale)

o   Infrastructure cost reduction:

§  Operate at higher temperatures

§  Use free air cooling & Eliminate chillers (More detail at: Chillerless Datacenter at 95F)

§  Use advanced power management with power capping to support power over-subscription with peak protection (see the 2007 paperhttp://research.google.com/pubs/pub32980.html for the early work on this topic)

·         Custom Server Design:

o   2 socket, half-width server design (6.3”W x 16.7”L)

o   4x SATA HDD connectors

o   4x DIMM slots per CPU socket

o   2x 1GigE NIC

o   1x 16x PCIe slot

·         Custom Rack Design:

o   480 VAC 3p power directly to the rack (higher voltage over a given conductor size reduces losses in distribution)

o   Very tall 56 RU rack (over 98” overall height)

o   12VDC distribution within the rack from two combined power supplies with distributed UPS

o   Power Supplies (PSU)

§  Input is 480VAC 3p

§  Output: 12V DC

§  Servers are 12VDC only boards

§  Each PSU is 4.5KW

§  2 PSUs/rack  so rack is 9.0KW max

o   Distributed UPS

§  Each PSU includes an UPS made up of 4 groups of 4013.2V batteries

§  Overall 160 discrete batteries per UPS

§  Technology not specified but based upon form factor and rough power estimate, I suspect they may be Li-ion 18650. See Commodity Parts for a note on these units.

o   By putting 2 PSUs per rack they avoid transporting low voltage (12VDC) further than 1/3 of a rack distance (under 1 yard) and only 4.5 KW is transported so moderately sized bus bars can be used.

o   Rack Networking:

§  Rack networking is interesting with 2 to 4 tor switches per rack. We know the servers are 1 GigE connected and there are up to 96 per rack which yields two possibilities: 1) they are using 24 port GigE switches or 2) they are using 48 port GigE and fat tree network topology (see VL2: A Scalable and Flexible Datacenter Network for more on a fat tree). 24 port TORs are not particularly cost effective, so I would assume 48x1GigE TORs in a fat tree design which is nice work.

o   Report that the CPUs can be lower power parts including Intel Atom, AMD Bobcat, and ARM (more on lower powered processors in server applications: http://perspectives.mvdirona.com/2011/01/16/NVIDIAProjectDenverARMPoweredServers.aspx)

·         Power proportionality: Shows that server at 0% load consumes 32% of peak power (this is an amazingly good server – 45% to 60% is much closer to the norm)

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, February 20, 2011 7:00:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Services
 Friday, February 11, 2011

Ben Black always has interesting things on the go. He’s now down in San Francisco working on his startup Fastip which he describes as “an incredible platform for operating, exploring, and optimizing data networks.” A couple of days ago Deepak Singh sent me to a recent presentation of Ben’s I found interesting: Challenges and Trade-offs in Building a Web-scale Real-time Analytics System.

 

The problem described in this talk was “Collect, index, and query trillions of high dimensionality

records with seconds of latency for ingestion and response.” What Ben is doing is collecting per flow networking data with tcp/ip 11-tuples (src_mac, dst_mac, src_IP, dest_IP, …) as the dimension data and, as metrics, he is tracking start usecs, end usecs, packets, octets, and UID. This data is interesting for two reasons: 1) networks are huge, massively shared resources and most companies haven’t really a clue on the details of what traffic is clogging it and have only weak tools to understand what traffic is flowing – the data sets are so huge, the only hope is to sample it with solutions like Cisco's NetFlow. The second reason I find this data interesting is closely related: 2) it is simply vast and  I love big data problems. Even on small networks, this form of flow tracking produces a monstrous data set very quickly. So, it’s an interesting problem in that it’s both hard and very useful to solve.

 

Ben presented 3 possible solutions and why they don’t work before offering a solution. The failed approaches that couldn’t cope with high dimensionality and the sheer volume of the dataset:

1.       HBase: Insert into HBase then retrieve all records in a time range and filter, aggregate, and sort

2.       Cassandra: Insert all records into Cassandra partitioned over a large cluster with each dimension indexed independently. Select qualifying records on each dimension, aggregate, and sort.

3.       Statistical Cassandra: Run a statistical sample over the data stored in Cassandra in the previous attempt.

 

The end solution proposed by the presenter is to treat it as an Online Analytic Processing (OLAP) problem. He describes OLAP as “A business intelligence (BI) approach to swiftly answer multi-dimensional analytics queries by structuring the data specifically eliminate expensive processing at query time, even at a cost of enormous storage consumption.” OLAP is essentially a mechanism to support fast query over high-dimensional data by pre-computing all interesting aggregations and storing the pre-computed results in a highly compressed form that is often kept memory resident. However, in this case, the data set is far too large to be practical for an in-memory approach.

 

Ben draws from two papers to implement an OLAP based solution at this scale:

·         High-Dimensional OLAP: A Minimal Cubing Approach by Li, Han, and Gonzalez

·         Sorting improves word-aligned bitmap indexes by Lemire, Kaser, and Aouiche

 

The end solution is:

·         Insert records into Cassandra.

·         Materialize lower-dimensional cuboids using bitsets and then join as needed.

·         Perform all query steps directly in the database


Ben’s concluding advice:

·         Read the literature

·         Generic software is 90% wrong at scale, you just don’t know which 90%.

·         Iterate to discover and be prepared to start over

 

If you want to read more, the presentation is at: Challenges and Trade-offs in Building a Web-scale Real-time Analytics System.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Friday, February 11, 2011 5:41:36 AM (Pacific Standard Time, UTC-08:00)  #    Comments [12] - Trackback
Software
 Sunday, February 06, 2011

Last week, Sudipta Sengupta of Microsoft Research dropped by the Amazon Lake Union campus to give a talk on the flash memory work that he and the team at Microsoft Research have been doing over the past year.  Its super interesting work. You may recall Sudipta as one of the co-authors on the VL2 Paper (VL2: A Scalable and Flexible Data Center Network) I mentioned last October.

 

Sudipta’s slides for the flash memory talk are posted at Speeding Up Cloud/Server Applications With Flash Memory and my rough notes follow:

·         Technology has been used in client devices for more than a decade

·         Server side usage more recent and the difference between hard disk drive and flash characterizes brings some challenges that need to be managed in the on-device Flash Translation Layer (FTL)  or in the operating systems or Application layers.

·         Server requirements are more aggressive across several dimensions including required random I/O rates and higher reliability and durability (data life) requirements.

·         Key flash characteristics:

·         10x more expensive than HDD

·         10x cheaper than RAM

·         Multi Level Cell (MLC): ~$1/GB

·         Single Level Cell (SLC): ~$3/GB

·         Laid out as an linear array of flash blocks where a block is often 128k and a page is 2k

·         Unfortunately the unit of erasure is a full block but the unit of read or write is 2k and this makes the write in place technique used in disk drives not workable.

·         Block erase is a fairly slow operation at 1500 usec whereas read or write is 10 to 100 usec.

·         Wear is an issue with SLC supporting O(100k) erases and MLC O(10k)     

·         The FTL is responsible for managing the mapping between logical pages and physical pages such that logical pages can be overwritten and hot page wear is spread relatively evenly over the device.

·         Roughly 1/3 the power consumption of a commodity disk and 1/6 the power of an enterprise disk

·         100x the ruggedness over disk drives when active

·         Research Project: FlashStore

·         Use flash memory as a cache between RAM and HDD

·         Essentially a flash aware store where they implement a log structured block store (this is essentially what the FTLs do in the device implemented at the application layer.

·         Changed pages are written through to flash sequentially and an in-memory index of pages is maintained so that pages can be found quickly on the flash device.

·         On failure the index structure can be recovered by reading the flash device

·         Recently unused pages are destaged asynchronously to disk

·         A key contribution of this work is a very compact form for the index into the flash cache

·         Performance results excellent and you can find them in the slides and the papers referenced below

·         Research Project: ChunkStash

·         A very high performance, high throughput key-value store

·         Tested on two production workloads:

·         Xbox Live Primetime online gaming

·         Storage deduplication

·         The storage dedeuplication test is a good one in that dedupe is most effective with a large universe of objects to run deduplication over. But a large universe requires a large index. The most interesting challenge of deduplication is to keep the index size small through aggressive compaction

·         The slides include a summary of dedupe works and shows the performance and compression ratios they have achieved with ChunkStash

 

For those interested in digging deeper, the VLDB and USENIX papers are the right next stops:

·         http://research.microsoft.com/apps/pubs/default.aspx?id=141508 (FlashStore paper, VLDB 2010)

·         http://research.microsoft.com/apps/pubs/default.aspx?id=131571 (ChunkStash paper, USENIX ATC 2010)

·         Slides: http://mvdirona.com/jrh/talksandpapers/flash-amazon-sudipta-sengupta.pdf

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, February 06, 2011 1:48:10 PM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
Hardware | Software
 Sunday, January 16, 2011

NVIDIA has been an ARM licensee for quite some time now.  Back in 2008 they announced Tegra, an embedded client processor including an ARM core and NVIDIA graphics aimed at smartphones and mobile handsets. 10 days ago, they announced Project Denver where they are building high-performance ARM-based CPUs, designed to power systems ranging from “personal computers and servers to workstations and supercomputers”. This is interesting for a variety of reasons, first they are entering the server CPU market. Second NVIDIA is joining Marvell and Calxeda (previously Smooth-Stone) in taking the ARM architecture and targeting server-side computing.

 

ARM is an interesting company in that they produce designs and these designs get adapted by licensees including Texas instruments, Samsung, Qualcomm, and even unlikely players such as Microsoft. These licensees may fab their own parts or be fab-less design shops depending upon others for volume production. Unlike Intel, ARM doesn’t really produce chips focusing just on design.

 

ARM has become an incredibly important instruction set architecture powering smartphones, low-end network routers, printers, copiers, tablets, and other embedded applications. But things are changing, arm is now producing designs appropriate for server-side computing at the same time that power consumption is becoming a key measure of server-side computing cost. The ARM design team are masters of low power designs and generations of ARMs have focused on power management. ARM has an impressively efficiently design.

 

Linux powers many of the ARM-based devices mentioned above and consequently Linux runs well on ARM processors. Completing the picture, Microsoft has announced that the next version of windows will also support ARM devices: Microsoft Announced Support of System on a Chip Architectures from Intel, AMD, and ARM for next Version of Windows.

 

Other articles are ARM and micro-slice computing:

·         CEMS: Low-Cost, Low-power Servers for Internet Scale Services

·         A Fast Array of Wimpy Nodes

·         Very Low-Cost, Low-Power Servers

·         Linux/Apache on ARM Processors

·         ARM Cortex-A9 SMP Design Announced

Other articles on the NVIDIA announcement:

·         Nvidia Developing ARM Processors for Servers

·         Nvidia Turns to ARM for Server Chips to Kill Intel

 A closely related article from the Barry Evans and Karl Freund of ARM-powered, server startup Calxeda: http://gigaom.com/cloud/look-out-your-data-center-is-about-to-change-again/.

 

We are on track for renewed competition in the server-side computing market segment and intense competition on power efficiency at the same time as internet-scale service operators are willing to run whatever processor is least expensive and most power efficient. With competition comes innovation and I see a good year coming.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, January 16, 2011 6:06:38 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Thursday, January 13, 2011

If you have experience in database core engine development either professionally, on open source, or at university send me your resume. When I joined the DB world 20 years ago, the industry was young and the improvements were coming ridiculously fast.  In a single release we improved DB2 TPC-A performance by a factor of 10x. Things were changing quickly industry-wide.  These days single-server DBs are respectably good. It’s a fairly well understood space. Each year more features are added and a few percent performance improvement may happen but the code bases are monumentally large, many of the development teams are over 1,000 engineers, and things are happening anything but quickly.

 

If you are an excellent engineer, have done systems or DB work in the past, and are interested in working on the next decade’s problems in database, drop me a note (james@amazon.com).

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

Thursday, January 13, 2011 8:12:54 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Sunday, January 09, 2011

Megastore is the data engine supporting the Google Application Engine. It’s a scalable structured data store providing full ACID semantics within partitions but lower consistency guarantees across partitions.

 

I wrote up some notes on it back in 2008 Under the Covers of the App Engine Datastore and posted Phil Bernstein’s excellent notes from a 2008 SIGMOD talk: Google Megastore. But there has been remarkably little written about this datastore over the intervening couple of years until this year’s CIDR conference papers were posted. CIDR 2011 includes Megastore: Providing Scalable, Highly Available Storage for Interactive Services.

 

My rough notes from the paper:

·         Megastore is built upon BigTable

·         Bigtable supports fault-tolerant storage within a single datacenter

·         Synchronous replication based upon Paxos and optimized for long distance inter-datacenter links

·         Partitioned into a vast space of small databases each with its own replicated log

·         Each log stored across a Paxos cluster

·         Because they are so aggressively partitioned, each Paxos group only has to accept logs for operations on a small partition. However, the design does serialize updates on each partition

·         3 billion writes and 20 billion read transactions per day

·         Support for consistency unusual for a NoSQL database but driven by (what I believe to be) the correct belief that inconsistent updates make many applications difficult to write (see I Love Eventual Consistency but …)

·         Data Model:

·         The data model is declared in a strongly typed schema

·         There are potentially many tables per schema

·         There are potentially many entities per table

·         There are potentially many strongly typed properties per entity

·         Repeating properties are allowed

·         Tables can be arranged hierarchically where child tables point to root tables

·         Megastore tables are either entity group root tables or child tables

·         The root table and all child tables are stored in the same entity group

·         Secondary indexes are supported

·         Local secondary indexes index a specific entity group and are maintained consistently

·         Global secondary indexes index across entity groups are asynchronously updates and eventually consistent

·         Repeated indexes: supports indexing repeated values (e.g. photo tags)

·         Inline indexes provide a way to denormalize data from source entities into a related target entity as a virtual repeated column.

·         There are physical storage hints:

·         “IN TABLE” directs Megastore to store two tables in the same underlying BigTable

·         “SCATTER” attribute prepends a 2 byte hash to each key to cool hot spots on tables with monotonically increasing values like dates (e.g. a history table).

·         “STORING” clause on an index supports index-only-access by redundantly storing additional data in an index. This avoids the double access often required of doing a secondary index lookup to find the correct entity and then selecting the correct properties from that entity through a second table access. By pulling values up into the secondary index, the base table doesn’t need to be accessed to obtain these properties.

·         3 levels of read consistency:

·         Current: Last committed value

·         Snapshot: Value as of start of the read transaction

·         Inconsistent reads: used for cross entity group reads

·         Update path:

·         Transaction writes its mutations to the entity groups write-ahead log and then apply the mutations to the data (write ahead logging).

·         Write transaction always begins with a current read to determine the next available log position. The commit operation gathers mutations into a log entry, assigns an increasing timestamp, and appends to log which is maintained using paxos.

·         Update rates within a entity group are seriously limited by:

·         When there is log contention, one wins and the rest fail and must be retried.

·         Paxos only accepts a very limited update rate (order 10^2 updates per second).

·         Paper reports that “limiting updates within an entity group to a few writes per second per entity group yields insignificant write conflicts”

·         Implication: programmers must shard aggressively to get even moderate update rates and consistent update across shards is only supported using two phase commit which is not recommended.

·         Cross entity group updates are supported by:

·         Two-phase commit with the fragility that it brings

·         Queueing and asynchronously applying the changes

·         Excellent support for backup and redundancy:

·         Synchronous replication to protect against media failure

·         Snapshots and incremental log backups

 

Overall, an excellent paper with lots of detail on a nicely executed storage system. Supporting consistent read and full ACID update semantics is impressive although the limitation of not being able to update an entity group at more than a “few per second” is limiting.

 

The paper: http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf.

 

Thanks to Zhu Han, Reto Kramer, and Chris Newcombe for all sending this paper my way.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, January 09, 2011 10:38:39 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Services
 Thursday, December 16, 2010

Years ago I incorrectly believed special purpose hardware was a bad idea. What was a bad idea is high-markup, special purpose devices sold at low volume, through expensive channels. Hardware implementations are often best value measured in work done per dollar and work done per joule. The newest breed of commodity networking parts from Broadcom, Fulcrum, Dune (now Broadcom), and others is a beautiful example of Application Specific Integrated Circuits being the right answer for extremely hot code kernels that change rarely.

 

I’ve long been interested in highly parallel systems and in heterogeneous processing. General Purpose Graphics Processors are firmly hitting the mainstream with 17 of the Top 500 now using GPGPUs (Top 500: Chinese Supercomputer Reigns). You can now rent GPGPU clusters from EC2 $2.10/server/hour where each server has dual NVIDIA Tesla M2050 GPUs delivering a TeraFLOP per node. For more on GPGPUs, see HPC in the Cloud with GPGPUs and GPU Clusters in 10 minutes.

 

Some time ago Zach Hill sent me a paper writing up Radix sort using GPGPUs. The paper shows how to achieve a better than 3x on the NVIDIA GT200-hosted systems. For most of us, sort isn’t the most important software kernel we run, but I did find the detail behind the GPGPU-specific optimizations interesting. The paper is at http://www.mvdirona.com/jrh/TalksAndPapers/RadixSortTRv2.pdf and the abstract is below.

 

Abstract.

 

This paper presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Compared to the state-of-the-art, our radix sorting methods exhibit speedup of at least 2x for all generations of NVIDIA GPGPUs, and up to 3.7x for current GT200-based models. Our implementations demonstrate sorting rates of 482 million key-value pairs per second, and 550 million keys per second (32-bit). For this domain of sorting problems, we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture. These results motivate a different breed of parallel primitives for GPGPU stream architectures that can better exploit the memory and computational resources while maintaining the flexibility of a reusable component. Our sorting performance is derived from a parallel scan stream primitive that has been generalized in two ways: (1) with local interfaces for producer/consumer operations (visiting logic), and (2) with interfaces for performing multiple related, concurrent prefix scans (multi-scan).

 

As part of this work, we demonstrate a method for encoding multiple compaction problems into a single, composite parallel scan. This technique yields a 2.5x speedup over bitonic sorting networks for small problem instances, i.e., sequences that can be entirely sorted within the shared memory local to a single GPU core.

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Thursday, December 16, 2010 9:06:43 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Hardware
 Monday, December 06, 2010

Even working in Amazon Web Services, I’m finding the frequency of new product announcements and updates a bit dizzying. It’s amazing how fast the cloud is taking shape and the feature set is filling out. Utility computing has really been on fire over the last 9 months. I’ve never seen an entire new industry created and come fully to life this fast. Fun times.


Before joining AWS, I used to say that I had an inside line on what AWS was working upon and what new features were coming in the near future.  My trick? I went to AWS customer meetings and just listened. AWS delivers what customers are asking for with such regularity that it’s really not all that hard to predict new product features soon to be delivered. This trend continues with today’s announcement. Customers have been asking for a Domain Name Service with consistency and, today, AWS is announcing the availability of Route 53, a scalable, highly-redundant and reliable, global DNS service.

 

The Domain Name System is essentially a global, distributed database that allows various pieces of information to be associated with a domain name.  In the most common case, DNS is used to look up the numeric IP address for an domain name. So, for example, I just looked up Amazon.com and found that one of the addresses being used to host Amazon.com is 207.171.166.252. And, when your browser accessed this blog (assuming you came here directly rather than using RSS) it would have looked up perspectives.mvdirona.com to get an IP address. This mapping is stored in an DNS “A” (address) record. Other popular DNS records are CNAME (canonical name), MX (mail exchange), and SPF (Sender Policy Framework). A full list of DNS record types is at: http://en.wikipedia.org/wiki/List_of_DNS_record_types. Route 53 currently supports:

                    A (address record)

                    AAAA (IPv6 address record)

                    CNAME (canonical name record)

                    MX (mail exchange record)

                    NS (name server record)

                    PTR (pointer record)

                    SOA (start of authority record)

                    SPF (sender policy framework)

                    SRV (service locator)

                    TXT (text record)

 

DNS, on the surface, is fairly simple and is easy to understand. What is difficult with DNS is providing absolute rock-solid stability at scales ranging from a request per day on some domains to billions on others. Running DNS rock-solid, low-latency, and highly reliable is hard.  And it’s just the kind of problem that loves scale. Scale allows more investment in the underlying service and supports a wide, many-datacenter footprint.

 

The AWS Route 53 Service is hosted in a global network of edge locations including the following 16 facilities:

·         United States

                    Ashburn, VA

                    Dallas/Fort Worth, TX

                    Los Angeles, CA

                    Miami, FL

                    New York, NY

                    Newark, NJ

                    Palo Alto, CA

                    Seattle, WA

                    St. Louis, MO

·         Europe

                    Amsterdam

                    Dublin

                    Frankfurt

                    London

·         Asia

                    Hong Kong

                    Tokyo

                    Singapore

 

Many DNS lookups are resolved in local caches but, when there is a cache miss, it will need to be routed back to the authoritative name server.  The right approach to answering these requests with low latency is to route to the nearest datacenter hosting an appropriate DNS server.  In Route 53 this is done using anycast. Anycast is a cool routing trick where the same IP address range is advertised to be at many different locations. Using this technique, the same IP address range is advertized as being in each of the world-wide fleet of datacenters. This results in the request being routed to the nearest facility from a network perspective.

 

Route 53 routes to the nearest datacenter to deliver low-latency, reliable results. This is good but Route 53 is not the only DNS service that is well implemented over a globally distributed fleet of datacenters. What makes Route 53 unique is it’s a cloud service. Cloud means the price is advertised rather than negotiated.  Cloud means you make an API call rather than talking to a sales representative. Cloud means it’s a simple API and you don’t need professional services or a customer support contact. And cloud means its running NOW rather than tomorrow morning when the administration team comes in. Offering a rock-solid service is half the battle but it’s the cloud aspects of Route 53 that are most interesting. 

 

Route 53 pricing is advertised and available to all:

·         Hosted Zones: $1 per hosted zone per month

·         Requests: $0.50 per million queries for first billion queries and $0.25 per million queries over 1B month

 

You can have it running in less time than it took to read this posting. Go to: ROUTE 53 Details. You don’t need to talk to anyone, negotiate a volume discount, hire a professional service team, call the customer support group, or wait until tomorrow. Make the API calls to set it up and, on average, 60 seconds later you are fully operating.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, December 06, 2010 5:37:15 AM (Pacific Standard Time, UTC-08:00)  #    Comments [7] - Trackback
Services
 Monday, November 29, 2010

I love high-scale systems and, more than anything, I love data from real systems. I’ve learned over the years that no environment is crueler, less forgiving, or harder to satisfy than real production workloads. Synthetic tests at scale are instructive but nothing catches my attention like data from real, high-scale, production systems. Consequently, I really liked the disk population studies from Google and CMU at FAST2007 (Failure Trends in a Large Disk Population, Disk Failures in the Real World: What does a MTBF of 100,000 hours mean to you).  These two papers presented actual results from independent production disk populations of 100,000 each. My quick summary of these 2 papers is basically “all you have learned about disks so far is probably wrong.”

 

Disk failures are often the #1 or #2 failing component in a storage system usually just ahead of memory. Occasionally fan failures lead disk but that isn’t the common case. We now have publically available data on disk failures but not much has been published on other component failure rates and even less on the overall storage stack failure rates. Cloud storage systems are multi-tiered, distributed systems involving 100s to even 10s of thousands of servers and huge quantities of software. Modeling the failure rates of discrete components in the stack is difficult but, with the large amount of component failure data available to large fleet operators, it can be done. What’s much more difficult to model are correlated failures.

 

Essentially, there are two challenges encountered when attempting to modeling overall storage system reliability: 1) availability of component failure data and 2) correlated failures. The former is available to very large fleet owners but is often unavailable publically. Two notable exceptions are disk reliability data from the two FAST’07 conference papers mentioned above. Other than these two data points, there is little credible component failure data publically available.  Admittedly, component manufacturers do publish MTBF data but these data are often owned by the marketing rather than engineering teams and they range between optimistic and epic works of fiction.

 

Even with good quality component failure data, modeling storage system failure modes and data durability remains incredibly difficult. What makes this hard is the second issue above: correlated failure. Failures don’t always happen alone, many are correlated, and certain types of rare failures can take down the entire fleet or large parts of it. Just about every model assumes failure independence and then works out data durability to many decimal points. It makes for impressive models with long strings of nines but the challenge is the model is only as good as the input. And one of the most important model inputs is the assumption of component failure independence which is violated by every real-world system of any complexity. Basically, these failure models are good at telling you when your design is not good enough but they can never tell you how good your design actually is nor whether it is good enough.

 

Where the models break down is in modeling rare events and non-independent failures. The best way to understand common correlated failure modes is to study storage systems at scale over longer periods of time. This won’t help us understand the impact of very rare events. For example, Two thousand years of history would not helped us model or predict that a airplane would be flown into the World Trade Center.  And certainly the odds of it happening again 16 min and 20 seconds later would be close to impossible. Studying historical storage system failure data will not help us understand the potential negative impact of very rare black swan events but it does help greatly in understanding the more common failure modes including correlated or non-independent failures.

 

Murray Stokely recently sent me Availability in Globally Distributed Storage Systems which is the work of a team from Google and Columbia University. They look at a high scale storage system at Google that includes multiple clusters of Bigtable which is layered over GFS which is implemented as a user–mode application over Linux file system. You might remember Stokely from my Using a post I did back in March titled Using a Market Economy. In this more recent paper, the authors study 10s of Google storage cells each of which is comprised of between 1,000 and 7,000 servers over a 1 year period. The storage cells studied are from multiple datacenters in different regions being used by different projects within Google.

 

I like the paper because it is full of data on a high-scale production system and it reinforces many key distributed storage system design lessons including:

·         Replicating data across multiple datacenters greatly improves availability because it protects against correlated failures.

o    Conclusion: Two way redundancy in two different datacenters is considerably more durable than 4 way redundancy in a single datacenter.

·         Correlation among node failures dwarfs all other contributions to unavailability in production environments.

·         Disk failures can result in permanent data loss but transitory node failures account for the majority of unavailability.

 

To read more: http://research.google.com/pubs/pub36737.html

 

The abstract of the paper:

Highly available cloud storage is often implemented with complex, multi-tiered distributed systems built on top of clusters of commodity servers and disk drives. Sophisticated management, load balancing and recovery techniques are needed to achieve high performance and availability amidst an abundance of failure sources that include software, hardware, network connectivity, and power issues. While there is a relative wealth of failure studies of individual components of storage systems, such as disk drives, relatively little has been reported so far on the overall availability behavior of large cloud-based storage services. We characterize the availability properties of cloud storage systems based on an extensive one year study of Google's main storage infrastructure and present statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compare data availability under a variety of system parameters given the real patterns of failures observed in our fleet.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, November 29, 2010 5:55:51 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Sunday, November 21, 2010

Achieving a PUE of 1.10 is a challenge under any circumstances but the vast majority of facilities that do approach this mark are using air-side economization. Essentially using outside air to cool the facility. Air-side economization brings some complexities such as requiring particulate filters and being less effective in climates that are both hot and humid. Nonetheless, even with the challenges, air-side economization is one of the best techniques, if not the best, of improving datacenter efficiency.

 

As a heat transport water is both effective and efficient. The challenges of using water in open-circuit datacenter cooling designs are largely social and regulatory. Most of the planet is enshrouded by two fluids, air and water. It’s a bit ironic that emitting heat into one of the two is considered uninteresting while emitting heat into the other brings much more scrutiny. Dumping heat into air concerns few and is the cooling technique of choice for nearly all data centers (Google Saint-Ghislain Belgium being a notable exception). However, if the same heat in the same quantities is released in water, it causes considerably more environmental concern even though it’s the same amount of heat being released to the environment. Heat pollution is the primary concern.

 

The obvious technique to avoid some of the impact of thermal pollution is massive dilution. Use seawater or very high volumes of river water such that the local change is immeasurably small. Seawater has been used in industrial and building cooling applications. Seawater brings challenges in that it is incredibly corrosive which drives up build and maintenance costs but it has been used in a wide range of high-scale applications with success. Freshwater cooling relieves some of the corrosion concerns and has been used effectively for many large-scale cooling requirements including nuclear power plants. I’ve noticed there is often excellent fishing downstream of these facilities so there clearly is substantial environmental impact caused by these thermal emissions but this need  not be the case. There exist water cooling techniques with far less environmental impact.

 

For example, the cities of Zurich and Geneva and the universities of ETH and Cornell use water for some of their heating & cooling requirements. This technique is effective and its impact on the environment can be made arbitrarily small. In a slightly different approach, the city of Toronto employs deep lake water cooling to cool buildings in its downtown core. In this design, the warm water intake is taken in 3.1 miles off shore at a depth of 272’. The city of Toronto avoids any concerns about thermal pollution by using the exhaust water from the cooling system as their utility water intake so the slightly warmer water is not directly released back into the environment.

 

Given the advantages of water over air in cooling applications and given that the environmental concerns can be mitigated, why not use the technique more broadly in datacenters? One of the prime reasons is that water is not always available. Another is that regulatory concerns bring more scrutiny and, even excellent designs without measurable environmental impact, will still take longer to get approved than a conventional air-cooled approaches. However it can be done and it does produce a very power efficient facility. The DeepGreen datacenter project in Switzerland perhaps the best examples I’ve seen so far. 

Before looking at the innovative mechanical systems used in Deepgreen, the summary statistics look excellent:

·         46MW with 129k sq ft of raised floor (with upgrade to 70MW possible)

·         Estimated PUE of 1.1

·         Hydro and nuclear sourced power

·         356 W/sq ft average

·         5,200 racks with average rack power of 8.8kW and maximum rack power of 20kW

·         Power cost: $0.094/kW hr (compares well across EU).

·         28 mid-voltage 2.5 MW generators with 48 hours of onsite diesel

 

The 46MW facility is located in the heart of Switzerland on Lake Walensee:

 


Google Maps: http://maps.google.com/maps?f=q&source=s_q&hl=de&geocode=&q=Werkhof+Bi%C3%A4sche,+Mollis,+Schweiz&sll=37.0625,-95.677068&sspn=45.601981,84.990234&ie=UTF8&hq=&hnear=Werkhof,+8872+Mollis,+Glarus,+Schweiz&ll=47.129016,9.09462&spn=0.307857,0.663986&t=p&z=11

The overall datacenter is well engineered but it is the mechanical systems that are most worthy of note. Following on the diagram below, this facility is cooled using 43F source water from 197’ below the surface. The source water is brought in through dual redundant intake pipes to the pumping station with 6 high-capacity pumps in a 2*(N+1) configuration. The pumps move 668,000 gallons per hour at full cooling load.

 

The fairly clean lake water is run through a heat exchanger (not shown in the diagram below) to cool the closed-system, chilled water loop used  in the datacenter. The use of a heat exchanger avoids bringing impurities or life forms into the datacenter chilled water loop. The chilled water loop forms part of a conventional within-the-datacenter cooling system design. The difference is they have completely eliminated process-based cooling (air conditioning) and water towers avoiding both the purchase cost and the power these equipment would have consumed. In the diagrams below you’ll see the Deepgreen design followed by a conventional datacenter cooling system for comparison purposes:

 

Conventional datacenter mechanical system for comparision:

 

The Deepgreen facility mitigates the impact of thermal pollution through a combination of dilution, low deltaT, and deep water release.

 

I’ve been conversing with Andre Oppermann, the CTO of Deepgreen for nearly a year on this project. Early on, I was skeptical they would be able to work through the environmental concerns in any reasonable time frame. I wasn’t worried about the design – its well engineered. My concerns were primarily centered around slowdowns in permitting and environmental impact studies. They have done a good job of working through those issues and I really like the resultant design. Thanks to Andre for sending all this detail my way. It’s a super interesting project and I’m glad we can now talk about it publically.

 

If you are interested in a state of the art facility in Switzerland, I recommend you contact Andre, the CTO of Deepgreen, at: oppermann@deepgreen.ch.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, November 21, 2010 9:19:42 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Saturday, November 20, 2010

I’m interested in low-cost, low-power servers and have been watching the emerging market for these systems since 2008 when I wrote CEMS: Low-Cost, Low-Power Servers for Internet Scale Services (paper, talk).  ZT Systems just announced the R1081e, a new ARM-based server with the following specs:

·         STMicroelectronics SPEAr 1310 with dual ARM® Cortex™-A9 cores

·         1 GB of 1333MHz DDR3 ECC memory embedded

·         1 GB of NAND Flash

·         Ethernet connectivity

·         USB

·         SATA 3.0

 It’s a shared infrastructure design where each 1RU module has 8 of the above servers. Each module includes:

·         8 “System on Modules“ (SOMs)

·         ZT-designed backplane for power and connectivity

·         One 80GB SSD per SOM

·         IPMI system management

·         Two Reatek 4+1 1Gb Ethernet switches on board with external uplinks

·         Standard 1U rack mount form factor

·         Ubuntu Server OS

·         250W 80+ Bronze Power Supply

 

Each module is under 80W so a rack with 40 compute modules would only draw 3.2kw for 320 low-power servers for a total of 740 cores/rack. Weaknesses of this approach are: only 2-cores per server, only 1GB/core, and the cores appear to be only 600 Mhz (http://www.32bitmicro.com/manufacturers/65-st/552-spear-1310-dual-cortex-a9-for-embedded-applications-). Four core ARM parts and larger physical memory support are both under development.

 

Competitors include SeaMicro with an Atom based design (SeaMicro Releases Innovative Intel Atom Server

)  and the recently renamed Calxeda (previously Smooth-Stone) has an ARM-based product under development.

 

Other notes on low-cost, low-powered servers:

·         SeaMicro Releases Innovative Intel Atom Server

·         When Very Low-Power, Low-Cost Servers don’t make snese

·         Very Low-Power Server Progress

·         The Case for Low-Cost, Low-Power Servers

·         2010 the Year of the Microslice Servers

·         Linux/Apache on ARM processors

·         ARM Cortex-A9 SMP Design Announced

  

From Datacenter Knowledge: New ARM-Based Server from ZT systems

 

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Saturday, November 20, 2010 8:06:34 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Hardware
 Wednesday, November 17, 2010

Earlier this week Clay Magouyrk sent me a pointer to some very interesting work: A Couple More Nails in the Coffin of the Private Compute Cluster: Benchmarks for the Brand New Cluster GPU Instance on Amazon EC2.

 

This detailed article has detailed benchmark results from runs on the new Cluster GPU Instance type and leads in with:

During the past few years it has been no secret that EC2 has been best cloud provider for massive scale, but loosely connected scientific computing environments. Thankfully, many workflows we have encountered have performed well within the EC2 boundaries. Specifically, those that take advantage of pleasantly parallel, high-throughput computing workflows. Still, the AWS approach to virtualization and available hardware has made it difficult to run workloads which required high bandwidth or low latency communication within a collection of distinct worker nodes. Many of the AWS machines used CPU technology that, while respectable, was not up to par with the current generation of chip architectures. The result? Certain use cases simply were not a good fit for EC2 and were easily beaten by in-house clusters in benchmarking that we conducted within the course of our research. All of that changed when Amazon released their Cluster Compute offering.

 

The author goes on to run the Saleable Heterogeneous cOmputing BenChmarking Suite and compare EC2 with Native performance and conclude:

 

With this new AWS offering, the line between internal hardware and virtualized, cloud-based hardware for high performance computing using GPUs has indeed been blurred.

 

Finally a run with a Cycle Computing customer workload:

Based on the positive results of our SHOC benchmarking, we approached a Fortune 500 Life Science and a Finance/Insurance clients who develop and use their own GPU-accelerated software, to run their applications on the GPU-enabled Cluster Compute nodes. For both applications, the applications perform a large number of Monte Carlo simulations for given set of initial data, all pleasantly parallel. The results, similar to the SHOC result, were that the EC2 GPU-enabled Cluster Compute nodes performed as well as, or better than, the in-house hardware maintained by our clients.

 

Even if you only have a second, give the results a scan: http://blog.cyclecomputing.com/2010/11/a-couple-more-nails-in-the-coffin-of-the-private-compute-cluster-gpu-on-cloud.html.

 

                                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Wednesday, November 17, 2010 6:15:32 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Monday, November 15, 2010

A year and half ago, I did a blog post titled heterogeneous computing using GPGPUs and FPGA. In that note I defined heterogeneous processing as the application of processors with different instruction set architectures (ISA) under direct application programmer control and pointed out that this really isn’t all that new a concept. We have had multiple ISAs in systems for years. IBM mainframes had I/O processors (Channel I/O Processors) with a different ISA than the general CPUs , many client systems have dedicated graphics coprocessors, and floating point units used to be independent from the CPU instruction set before that functionality was pulled up onto the chip. The concept isn’t new. 

 

What is fairly new is 1) the practicality of implementing high-use software kernels in hardware and 2) the availability of commodity priced parts capable of vector processing. Looking first at moving custom software kernels into hardware, Field Programmable Gate Arrays (FPGA) are now targeted by some specialized high level programming languages. You can now write in a subset of C++ and directly implement commonly used software kernels in hardware. This is still the very early days of this technology but some financial institutions have been experimenting with moving computationally expensive financial calculations into FPGAs to save power and cost. See Platform-Based Electronic System-Level (ESL) Synthesis for more on this.

 

The second major category of heterogeneous computing is much further evolved and is now beginning to hit the mainstream. Graphics Processing Units (GPUs) essentially are vector processing units in disguise.  They originally evolved as graphics accelerators but it turns out a general purpose graphics processor can form the basis of an incredible Single Instruction Multiple Data (SIMD) processing engine. Commodity GPUs have most of the capabilities of the vector units in early supercomputers. What’s missing is they have been somewhat difficult to program in that the pace of innovation is high and each model of GPU have differences in architecture and programming models. It’s almost impossible to write code that will directly run over a wide variety of different devices. And the large memories in these graphics processors typically are not ECC protected. An occasional pixel or two wrong doesn’t really matter in graphical output but you really do need ECC memory for server side computing.

 

Essentially we have commodity vector processing units that are hard to program and lack ECC. What to do? Add ECC memory and a abstraction  layer that hides many of the device-to-device differences. With those two changes, we have amazingly powerful vector units available at commodity prices.  One abstraction layer that is getting fairly broad pickup is Compute Unified Device Architecture or CUDA developed by NVIDIA. There are now CUDA runtime support libraries for many programming languages including C, FORTRAN, Python, Java, Ruby, and Perl.

 

Current generation GPUS are amazingly capable devices. I’ve covered the speeds and feeds of a couple in past postings: ATI RV770 and the NVIDIA GT200.

 

Bringing it all together, we have commodity vector units with ECC and an abstraction layer that makes it easier to program them and allows programs to run unchanged as devices are upgraded. Using GPUs to host general compute kernels is generically referred to as General Purpose Computation on Graphics Processing Units.  So what is missing at this point?  The pay-as-you go economics of cloud computing.

 

You may recall I was excited last July when Amazon Web Services announced the Cluster Compute Instance: High Performance Computing Hits the Cloud. The EC2 Cluster Compute Instance is capable but lacks a GPU:

·         23GB memory with ECC

·         64GB/s main memory bandwidth

·         2 x Intel Xeon X5570 (quad-core Nehalem)

·         2 x 845GB HDDs

·         10Gbps Ethernet Network Interface

 

What Amazon Web Services just announced is a new instance type with the same core instance specifications as the cluster compute instance above, the same high-performance network, but with the addition of two NVIDIA Tesla M2050 GPUs in each server. See supercomputing at 1/10th the Cost. Each of these GPGPUs is capable of over 512 GFLOPs and so, with two of these units per server, there is a booming teraFLOP per node.

 

Each server in the cluster is equipped with a 10Gbps network interface card connected to a constant bisection bandwidth networking fabric. Any node can communicate with any other node at full line rate. It’s a thing of beauty and a forerunner of what every network should look like.

 

There are two full GPUs in each Cluster GPU instance each of which dissipates 225W TDP. This felt high to me when I first saw it but, looking at work done per watt, it’s actually incredibly good for workloads amenable to vector processing.  The key to the power efficiency is the performance. At over 10x the performance of a quad core x86, the package is both power efficient and cost efficient.

 

The new cg1.4xlarge EC2 instance type:

·         2 x  NVIDIA Tesla M2050 GPUs

·         22GB memory with ECC

·         2 x Intel Xeon X5570 (quad-core Nehalem)

·         2 x 845GB HDDs

·         10Gbps Ethernet Network Interface

 

With this most recent announcement, AWS now has dual quad core servers each with dual GPGPUs connected by a 10Gbps full-bisection bandwidth network for $2.10 per node hour. That’s $2.10 per teraFLOP. Wow.

·         The Amazon Cluster GPU Instance type announcement: Announcing Cluster GPU Instances for EC2

·         More information on the EC2 Cluster GPU instances: http://aws.amazon.com/ec2/hpc-applications/

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

 

Monday, November 15, 2010 4:57:46 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Sunday, November 14, 2010

The Top 500 Super Computer Sites list was just updated and AWS Compute Cluster is now officially in the top ½ of the list. That means when you line up all the fastest, multi-million dollar, government lab sponsored super computers from #1 through to #500, the AWS Compute Cluster Instance is at  #231. 

 

231

Amazon Web Services
United States

Amazon EC2 Cluster Compute Instances - Amazon EC2 Cluster, Xeon X5570 2.95 Ghz, 10G Ethernet / 2010
Self-made

7040

41.82

82.51

One of the fastest supercomputers in the world for $1.60/node hour. Cloud computing changes the economics in a pretty fundamental way.

More details at: High Performance Computing Hits the Cloud.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

Sunday, November 14, 2010 11:21:19 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Friday, November 12, 2010

Kushagra Vaid presented at Datacenter Power Efficiency at Hotpower ’10. The URL to the slides is below and my notes follow. Interesting across the board but most notable for all the detail on the four major Microsoft datacenters:

 

·         Services speeds and feeds:

o   Windows LiveID: More than 1B authentications/day

o   Exchange Hosted Services: 2 to 4B email messages per day

o   MSN: 550M unique visitors monthly

o   Hotmail: 400M active users

o   Messenger: 320M active users

o   Bing: >2B queries per month

·         Microsoft Datacenters: 141MW total, 2.25M sq ft

o   Quincy:

§  27MW

§  500k sq ft total

§  54W/sq ft

o   Chicago:

§  60MW

§  707k sq ft total

§  85W/sq ft

§  Over $500M invested

§  $8.30/W (assuming $500M cost)

§  3,400 tons of steel

§  190 miles of conduit

§  2,400 tons of copper

§  26,000 yards of concrete

§  7.5 miles of chilled water piping

§  1.5M man hours of labor

§  Two floors of IT equipment:

·         Lower floor: medium reliability container bay (standard IOS containers supplied by Dell)

·         Upper floor: high reliability traditional colo facility

§  Note the picture of the upper floor shows colo cages suggesting that Microsoft may not be using this facility for their internal workloads.

§  PUE: 1.2 to 1.5

§  3,000 construction related jobs

o   Dublin:

§  27MW

§  570k sq ft total

§  47W/sq ft

o   San Antonio:

§  27MW

§  477k sq ft total

§  57W/sq ft

·         The power densities range between 45 and 85W/sq ft range which is incredibly low for relatively new builds. I would have expected something in the 200W/sq ft to 225W/sq ft range and perhaps higher. I suspect the floor space numbers are gross floor space numbers including mechanical, electrical, and office space rather than raised floor.

·         Kushagra reports typical build costs range from $10 to $15/W

o   based upon the Chicago data point,  the Microsoft build costs are around $8/sq ft. Perhaps a bit more in the other facilities in that Chicago isn’t fully redundant on the lower floor whereas the generator count at the other facilities suggest they are.

·         Gen 4 design

o   Modular design but the use of ISO standard containers has been discontinued. However, they modules are prefabricated and delivered by flatbed truck

o   No mechanical cooling

§  Very low water consumption

o   30% to 50% lower costs

o   PUE of 1.05 to 1.1

·         Analysis of AC vs DC power distribution concluding that DC more efficient at low loads and AC at high loads

o   Over all the best designs are within 1 to 2% of each other

·         Recommends higher datacenter temperatures

·         Key observation:

o   The best price/performance and power/performance is often found in lower-power processors

·         Kushagra found substantial power savings using C6 and showed server idle power can be dropped to 31% of full load power

o   Good servers typically run in the 45 to 50% range and poor designs can be as bad as 80%

 

The slides: http://mvdirona.com/jrh/TalksAndPapers/KushagraVaidMicrosoftDataCenters.pdf

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Friday, November 12, 2010 12:02:55 PM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Services
 Sunday, October 31, 2010

I did a talk earlier this week on the sea change currently taking place in datacenter networks. In Datacenter Networks are in my Way I start with an overview of where the costs are in a high scale datacenter. With that backdrop, we note that networks are fairly low power consumers relative to the total facility consumption and not even close to the dominant cost. Are they actually a problem? The rest of the talk is arguing networks are actually a huge problem across the board including cost and power. Overall, networking gear lags behind the rest of the high-scale infrastructure world, block many key innovations, and actually are both cost and power problems when we look deeper.

 

The overall talk agenda:

·         Datacenter Economics

·         Is Net Gear Really the Problem?

·         Workload Placement Restrictions

·         Hierarchical & Over-Subscribed

·         Net Gear: SUV of the Data Center

·         Mainframe Business Model

·         Manually Configured & Fragile at Scale

·         New Architecture for Networking

 

In a classic network design, there is more bandwidth within a rack and more within an aggregation router than across the core. This is because the network is over-subscribed. Consequently, instances of a workload often needs to be placed topologically near to other instances of the workload, near storage, near app tiers, or on the same subnet. All these placement restrictions make the already over-constrained workload placement problem even more difficult. The result is either the constraints are not met which yields poor workload performance or the constraints are met but overall server utilization is lower due to accepting these constraints. What we want is all points in the datacenter equidistant and no constraints on workload placement.

 

Continuing on the over-subscription problem mentioned above, data intensive workloads like MapReduce and high performance computing workloads run poorly on oversubscribed networks.  Its not at all uncommon for a MapReduce workload to transport the entire data set at least once over the network during job run. The cost of providing a flat, all-points-equidistant network are so high, that most just accept the constraint and other run MapReduce poorly or only run them in narrow parts of the network (accepting workload placement constraints).

 

Net gear doesn’t consume much power relative to total datacenter power consumption – other gear in the data center are, in aggregate much worse. However, network equipment power is absolutely massive today and it is trending up fast. A fully configured Cisco Nexus 7000 requires 8 circuits of 5kw each. Admittedly some of that power is for redundancy but how can 120 ports possibly require as much power provisioned as 4 average sized full racks of servers? Net gear is the SUV of the datacenter.

 

The network equipment business model is broken. We love the server business model where we have competition at the CPU level, more competition at the server level, and an open source solution for control software.  In the networking world, it’s a vertically integrated stack and this slows innovation and artificially holds margins high. It’s a mainframe business model.

New solutions are now possible with competing merchant silicon from Broadcom, Marvell, and Fulcrum and competing switch designs built on all three. We don’t yet have the open source software stack but there are some interesting possibilities on the near term horizon with OpenFlow being perhaps the most interesting enabler. More on the business model and why I’m interested in OpenFlow at: Networking: The Last Bastion of the Mainframe Computing.

 

Talk slides: http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_POA20101026_External.pdf

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, October 31, 2010 11:30:28 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Hardware | Services
 Thursday, October 21, 2010

What happens when you really, really, focus on efficient infrastructure and driving down costs while delivering a highly available, high performance service? Well, it works.  Costs really do fall and the savings can be passed on to customers.

 

AWS prices have been falling for years but this is different.  Its now possible to offer an small EC2 instance for free. You can now have 750 hours of EC2 usage per month without charge.

 

The details:

 

·         750 hours of Amazon EC2 Linux Micro Instance usage (613 MB of memory and 32-bit and 64-bit platform support) – enough hours to run continuously each month*

·         750 hours of an Elastic Load Balancer plus 15 GB data processing*

·         10 GB of Amazon Elastic Block Storage, plus 1 million I/Os, 1 GB of snapshot storage, 10,000 snapshot Get Requests and 1,000 snapshot Put Requests*

·         5 GB of Amazon S3 storage, 20,000 Get Requests, and 2,000 Put Requests*

·         30 GB per of internet data transfer (15 GB of data transfer “in” and 15 GB of data transfer “out” across all services except Amazon CloudFront)*

·         25 Amazon SimpleDB Machine Hours and 1 GB of Storage**

·         100,000 Requests of Amazon Simple Queue Service**

·         100,000 Requests, 100,000 HTTP notifications and 1,000 email notifications for Amazon Simple Notification Service**

 

More information: http://aws.amazon.com/free/

 

                                      --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

 

Thursday, October 21, 2010 4:35:44 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Services
 Tuesday, October 19, 2010

Long time Amazon Web Services Alum Jeff Barr has written a book on AWS. Jeff’s been with AWS since the very early days and he knows the services well. The new book Host Your Web Site in the Cloud: Amazon Web Services Made Easy, covers each of the major AWS services, how to write code against them, with code examples in PHP. It covers S3, EC2, SQS, EC2 Monitoring, Auto Scaling, Elastic Load Balancing, and SimpleDB. The table of contents:

 

 Recommended if you are interested in Cloud Computing and AWS: http://www.amazon.com/Host-Your-Web-Site-Cloud/dp/0980576830.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, October 19, 2010 7:21:29 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<March 2011>
SunMonTueWedThuFriSat
272812345
6789101112
13141516171819
20212223242526
272829303112
3456789

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton