Tuesday, October 25, 2011

One of the talks that I particularly enjoyed yesterday at HPTS 2011 was Storage Infrastructure Behind Facebook Messages by Kannan Muthukkaruppan. In this talk, Kannan talked about the Facebook store for chats, email, SMS, & messages.


This high scale storage system is based upon HBase and Haystack. HBase is a non-relational, distributed database very similar to Google’s Big Table. Haystack is simple file system designed by Facebook for efficient photo storage and delivery. More on Haystack at: Facebook Needle in a Haystack.


In this Facebook Message store, Haystack is used to store attachments and large messages.  HBase is used for message metadata, search indexes, and small messages (avoiding the second I/O to Haystack for small messages like most SMS).


Facebook Messages takes 6B+ messages a day. Summarizing HBase traffic:

·         75B+ R+W ops/day with 1.5M ops/sec at peak

·         The average write operation inserts 16 records across multiple column families

·         2PB+ of cooked online data in HBase. Over 6PB including replication but not backups

·         All data is LZO compressed

·         Growing at 250TB/month


The Facebook Messages project timeline:

·         2009/12: Project started

·         2010/11: Initial rollout began

·         2011/07: Rollout completed with 1B+ accounts migrated to new store

·         Production changes:

o   2 schema changes

o   Upgraded to Hfile 2.0


They implemented a very nice approach to testing where, prior to release, they shadowed the production workload to the test servers.

After going into production the continued the practice of shadowing the real production workload into the test cluster to test before going into production:


The list of scares and scars from Kannan:

·         Not without our share of scares and incidents:

o   s/w bugs. (e.g., deadlocks, incompatible LZO used for bulk imported data, etc.)

§  found a edge case bug in log recovery as recently as last week!

·         performance spikes every 6 hours (even off-peak)!

o   cleanup of HDFS’s Recycle bin was sub-optimal! Needed code and config fix.

·         transient rack switch failures

·         Zookeeper leader election took than 10 minutes when one member of the quorum died. Fixed in more recent version of ZK.

·         HDFS Namenode – SPOF

·         flapping servers (repeated failures)

·         Sometimes, tried things which hadn’t been tested in dark launch!

o   Added a rack of servers to help with performance issue

§  Pegged top of the rack network bandwidth!

§  Had to add the servers at much slower pace. Very manual .

§  Intelligent load balancing needed to make this more automated.

·         A high % of issues caught in shadow/stress testing

·         Lots of alerting mechanisms in place to detect failures cases

o   Automate recovery for a lots of common ones

o   Treat alerts on shadow cluster as hi-pri too!

·         Sharding service across multiple HBase cells also paid off


Kannan’s slides are posted at: http://mvdirona.com/jrh/TalksAndPapers/KannanMuthukkaruppan_StorageInfraBehindMessages.pdf




James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com


Tuesday, October 25, 2011 1:03:10 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback

Rough notes from a talk on COSMOS, Microsoft’s internal Map reduce systems from HPTS 2011. This is the service Microsoft uses internally to run MapReduce jobs. Interesting, Microsoft plans to use Hadoop in the external Azure service even though COSMOS looks quite good: Microsoft Announces Open Source Based Cloud Service. Rough notes below:


Talk: COSMOS: Big Data and Big Challenges

Speaker: Ed Harris

·         Petabyte storage and computation systems

·         Used primarily by search and advertising inside Microsoft

·         Operated as a service with just over 4 9s of availability

·         Massively parallel processing based upon Dryad

o   Dryad is very similar to MapReduce

·         Use SCOPE (structured Computation Optimized for Parallel Execution) over Dryad

o   A SQL-like language with an optimizers implemented over Dryad

·         They run hundreds of virtual clusters. In this model, internal Microsoft teams buy servers and given them to COSMOS and are subsequently assured at least these resources

o   Average 85% CPU over the cluster

·         Ingest 1 to 2 PB/day

·         Roughly 30% of the Search fleet is running COSMOS

·         Architecture:

o   Store Layer

§  Many extent nodes store and compress streams

§  Streams are sequences of extents

§  CSM: Cosmos Store Layer handles names, streams, and replication

·         First level compression is light. Data that is kept more than a week is more aggressively compressed after a week on the assumption that data that lives a week will likely live longer

o   Execution Layer:

§  Jobs queue up on virtual clusters and then executed

o   SCOPE Layer

§  Compiler and optimizer for SCOPE

§  Ed said that the optimizer is a branch of the SQL Server optimizer

·         They have 60+ Phd internships each year and hire ~30 a year


James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com


Tuesday, October 25, 2011 8:37:20 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Sunday, October 23, 2011

From the Last Bastion of Mainframe Computing Perspectives post:


The networking equipment world looks just like mainframe computing ecosystem did 40 years ago. A small number of players produce vertically integrated solutions where the ASICs (the central processing unit responsible for high speed data packet switching), the hardware design, the hardware manufacture, and the entire software stack are stack are single sourced and vertically integrated.  Just as you couldn’t run IBM MVS on a Burrows computer, you can’t run Cisco IOS on Juniper equipment. When networking gear is purchased, it’s packaged as a single sourced, vertically integrated stack. In contrast, in the commodity server world, starting at the most basic component, CPUs are multi-sourced. We can get CPUs from AMD and Intel. Compatible servers built from either Intel or AMD CPUs are available from HP, Dell, IBM, SGI, ZT Systems, Silicon Mechanics, and many others.  Any of these servers can support both proprietary and open source operating systems. The commodity server world is open and multi-sourced at every layer in the stack.


Last week the Open Network Summit was hosted at Stanford University.  This conference focused on Software Defined Networks in general and Openflow specifically. Software defined networking separates out the router control plane responsible for what is in the routing table from the data plane that makes network packet routing decisions on the basis of what is actually in the routing table.  Historically, both operations have been implemented monolithically in each router. SDN, separates these functions allowing networking equipment to compete in how efficiently they route packets on the basis of instructions from a separate SDN control plane.


In the words of OpenFlow founder Nick Mckeown, Software Defined Networks (SDN), will: 1) empower network owners/operators, 2) increase the pace of network innovation, 3) diversify the supply chain, and 4) build a robust foundation for future networking innovation.


This conference was a bit of a coming of age for software defined networking for a couple of reasons. First, an excellent measure of relevance is who showed up to speak at the conference. From academia, attendees included Scott Shenker (Berkeley), Nick McKeown (Stanford), and Jennifer Rexford (Princeton).  From industry most major networking companies were represented by senior attendees including Dave Ward (Juniper), Dave Meyer (Cisco), Ken Duda (Arista), Mallik Tatipamula (Ericsson), Geng Lin (Dell), Samrat Ganguly (NEC),  and Charles Clark (HP). And some of the speakers from major networking user companies included: Stephen Stuart (Google), Albert Greenberg (Microsoft), Stuart Elby (Verizon), Rainer Weidmann (Deutsche Telekom), and Igor Gashinsky (Yahoo!). The full speaker list is up at: http://opennetsummit.org/speakers.html.


The second data point in support of SDN really coming of age was Dave Meyer, Cisco Distinguished Engineer, saying during his talk that Cisco was “doing Openflow”. I’ve always joked that Cisco would rather go bankrupt than support Openflow so this one definitely caught my interest. Since I wasn’t in attendance myself during Dave’s talk I checked in with him personally. He corrected that it wasn’t a product announcement. They have Openflow running on Cisco gear but “no product plans have been announced at this time”. Still exciting progress and hat’s off for Cisco for taking the first step. Good to see.


If you want a good summary of what is Software Defined Networking, perhaps the best description were the slides that Nick presented at the conference: http://mvdirona.com/jrh/TalksAndPapers/NickMckeown_ON%20Summit%20NickM%2010%202011.pdf.


If you are interested in what Cisco’s Dave Meyer presented at the summit, I’ve posted his slides here: http://mvdirona.com/jrh/TalksAndPapers/DavidMeyer_openflow_and_sdn_for_enterprises.pdf.


Other related postings I’ve made:

·         Datacenter Networks are in my Way

·         Stanford Clean Slate CTO Summit

·         Changes in Networking Systems

·         Software Load Balancing Using Software Defined Networking


Congratulations to the Stanford team for hosting a great conference and in helping to drive software defined networking from a great academic idea to what is rapidly becoming a supported option industry-wide.




James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com


Sunday, October 23, 2011 7:57:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Hardware | Software
 Thursday, October 20, 2011

Last night EMC Chief Executive Joe Tucci laid out his view of where the information processing world is going over the next decade and where EMC will focus.  His primary point was cloud computing is the future and big data is the killer app for the cloud. He laid out the history of big transitions in our industry and argued the big discontinuities were always driven by a killer application. He sees the cloud as the next big and important transition for our industry.


This talk was presented as part of the University of Washington Distinguished Lecturer Series. With six TV cameras covering the action, there were nearly as many as some University of Washington Huskies games and it was well attended. The next talk in the series will be Bill Gates on October 27 presenting The Opportunity Ahead: A Conversation with Bill Gates. I’ll be presenting Internet Scale Storage on November 1st.


If you are interested in any of the talks in the series, all are open to the public and the upcoming schedule is posted at: http://www.cs.washington.edu/news/newdlshome.html.


The most notable statistic from the Joe Tucci talk was the massive investment that EMC is making mergers and acquisitions. He said over the next 5 years, EMC will spend $10.5B in R&D – this number alone is amazingly large -- but what I found really startling was they expect to spend even more purchasing companies. They expect to spend $14.0B on M&A during this same period. That’s nearly $3B/year from just a single company. Amazing.


With many large companies increasingly looking to the startup community for new ideas and innovation, there is incredible opportunity for startups.  Joe emphasized the opportunity, saying that Washington in general and especially the University of Washington will likely be the source of many of these new companies. As large companies lean more on the startup community for new ideas, products, and services, it’s a good time to be starting a company.


My rough notes from the talk:


·         IDC reports:

o   This decade WW information content will grow 44x (0.9 zettabytes to 35.2)

o   90% unstructured

·         Big data has arrived

o   Mobile sensors

o   Social media

o   Video surveillance

o   Smart grids

o   Gene sequencing

o   Medical imaging

o   Geophysical exploration

·         73% maintaining existing infrastructure (true for 10 years)

o   JRH: I’ve heard this statistic before but it seems like nearly has to be the case the most companies are spending at least 3/4s of their investment continuing to running the business and around a ¼ on new applications. The statistic is usually presented as a problem but it feels like it might be close to the right ratio.

·         3D movie is about a petabyte with all camera angles and footage included

·         The average company is attacked 300 times per week

o   All CIO say this is way light – my home router gets nailed that many times in a good hour

·         IT staffing will increase less than 50% in next 10 years but the data under management will grow faster.

o   JRH: Again this seems like the desirable outcome where the data under management should be able to grow far faster than administrative team

·         EMCs Mission: To lead customers towards a hybrid cloud

o   Leading customers to x86 based private clouds and hybrid clouds

o   Burst, test & development, etc. into the public cloud

o   Hybrid cloud between private and public is the “big winner”

·         VM is basically a cloud operating systems

o   EMC still owns 80% of VMWare

o   There are now more than virtual machines shipped than physical machines

o   62% virtualized out of the gate

·         Applications like SAP, Oracle, and Microsoft are now available in the cloud

·         Killer app for the cloud is big data

o   Real time data analytics

·         New end user computing

o   IOS devices, android, windows, …

·         Tenets of cloud computing

o   Efficiency, control, choice => Agility

o   Control through policy, service levels, and cost

·         Big  competitors

o   IBM, HP, Cisco, Microsoft, …

o   EMC is big at $20B but not close to as big as these incumbents

o   JRH: I’ve never thought of EMC as the small, nimble competitor but I guess it’s all relative

·         Recent acquisitions in drive to cloud & big data

o   Isilon

o   Greenplum

o   Datadomain

o   RSA

·         Mammoth 5 year M&A plan: roughly ½ of investments in R&D and ½ in M&A

o   14.0B M&A

o   10.5B: R&D

·         EMC has 14,000 sales people so there is huge potential synergy in any acquisition

o   Adding a 14,000 person sales team to any reasonable product is going to produce considerable new revenue quickly

·         EMC is now 152 in fortune 500

o   Revenue is $17B

o   Free cash flow: $3.4b


Thanks to Ed Lazowska for hosting this talk and many in the University of Washington Distinguished Lecturer Series.


James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com


Thursday, October 20, 2011 6:24:29 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
 Thursday, October 13, 2011

We see press releases go by all the time and most of them deserve the yawn they get. But, one caught my interest yesterday. At the PASS Summit conference Microsoft Vice President Ted Kummert announced that Microsoft will be offering a big data solution based upon Hadoop as part of SQL Azure. From the Microsoft press release, “Kummert also announced new investments to help customers manage big data, including an Apache Hadoop-based distribution for Windows Server and Windows Azure and a strategic partnership with Hortonworks Inc.”


Clearly this is a major win for the early startup Hortonworks. Hortonworks is a spin out of Yahoo! and includes many of the core contributors to the Apache Hadoop distribution: Hortonwoks Taking Hadoop to Next Level.


This announcement is also a big win for the MapReduce processing model. First invented at Google and published in MapReduce: Simplified Data Processing on Large Clusters. The Apache Hadoop distribution is an open source implementation of MapReduce. Hadoop is incredibly widely used with Yahoo! running more than 40,000 nodes of Hadoop with their biggest single cluster now at 4,500 servers. Facebook runs a 1,100 node cluster and a second 300 node cluster. Linked in runs many clusters including deployments of 1,200, 580, and 120 nodes. See the Hadoop Powered By Page for many more examples.


In the cloud, AWS began offering Elastic MapReduce back in early 2009 and has been expanding the features supported by this offering steadily over the last couple of years adding support for Reserved Instances, Spot Instances, and Cluster Compute instances (on a 10Gb non-oversubscribed network – MapReduces just loves high bandwidth inter-node connectivity)and support for more regions with EMR available in Northern Virginia, Northern California, Ireland, Singapore, and Tokyo.


Microsoft expects to have a pre-production (what they refer to as a "community technology Preview") version of a Hadoop service available by the “end of 2011”.  This is interesting for a variety of reasons. First, its more evidence of the broad acceptance and applicability of the MapReduce model.  What is even more surprising is that Microsoft has decided in this case to base their MapReduce offering upon open source Hadoop rather than the Microsoft internally developed MapReduce service called Cosmos which is used heavily by the Bing search and advertising teams. The What is Dryad blog entry provides a good description of Cosmos and some of the infrastructure build upon the Cosmos core including Dryad, DryadLINQ, and SCOPE.


As surprising as it is to see Microsoft planning to offer MapReduce based upon open source rather than upon the internally developed and heavily used Cosmos platform, it’s even more surprising that they hope to contribute changes back to the open source community saying “Microsoft will work closely with the Hadoop community and propose contributions back to the Apache Software Foundation and the Hadoop project.”  


·         Microsoft Press Release: Microsoft Expands Data Platform

·         Hortonsworks Press Release: Hortonworks to Extend Apache Hadoop to Windows Users

·         Hortonworks Blog Entry: Bringing Apache Hadoop to Windows


Past MapReduce postings on Perspectives:

·         MapReduce in CACM

·         MapReduce: A Minor Step Forward

·         Hadoop Summit 2010

·         Hadoop Summit 2008

·         Hadoop Wins TeraSort

·         Google MapReduce Wins TeraSort

·         HadoopDB: MapReduce over Relational Data

·         Hortonworks Taking Hadoop to Next Level


James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com


Thursday, October 13, 2011 7:08:10 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Wednesday, October 05, 2011

Earlier today we lost one of the giants of technology. Steve Jobs was one of most creative, demanding, brilliant, hard-driving, and innovative leaders in the entire industry. He has created new business areas, introduced new business models, brought companies back from the dead, and fundamentally changed how the world as a whole interacts with computers. He was a visionary of staggering proportions with an unusual gift in his ability to communicate a vision and also the drive to seek perfection in the execution of his ideas. We lost a giant today.


From Apple: http://www.apple.com/pr/library/2011/10/05Statement-by-Apples-Board-of-Directors.html




James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

Wednesday, October 05, 2011 5:39:29 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback

 Saturday, October 01, 2011

I’ve been posting frequently on networking issues with the key point being the market is on the precipice of a massive change. There is a new model emerging.

·         Datacenter Networks are in my way

·         Networking: The Last Bastion of Mainframe Computing


We now have merchant silicon providers for the core Application Specific Integrated Circuits (ASICs) that form the core network switches and routers including Broadcom, Fulcrum (recently purchased by Intel), Marvell, Dune (purchased by Broadcom). We have many competing offerings for the control processor that supports the protocol stack including Freescale, Arm, and Intel. The ASIC providers build reference designs that get improved by many competing switch hardware providers including Dell, NEC, Quanta, Celestica, DNI, and many others. We have competition at all layers below the protocol stack. What’s needed is an open, broadly used, broadly invested networking stack. Credible options are out there with Quagga perhaps being the strongest contender thus far. Xorp is another that has many users. But, there still isn’t a protocol stack with the broad use and critical mass that has emerged in the server world with the wide variety of Linux distributions available.


Two recent new addition to the community are 1) the Open Networking Foundation, and 2) the Open Source Routing Forum. More on each:

Open Networking Foundation:

Founded in 2011 by Deutsche Telekom, Facebook, Google, Microsoft, Verizon, and Yahoo!, the Open Networking Foundation (ONF) is a nonprofit organization whose goal is to rethink networking and quickly and collaboratively bring to market standards and solutions. ONF will accelerate the delivery and use of Software-Defined Networking (SDN) standards and foster a vibrant market of products, services, applications, customers, and users.


Open Source Routing Forum

OSR will establish a "platform" supporting committers and communities behind the open source routing protocols to help the release of a mainstream, and stable code base, beginning with Quagga, most popular routing code base. This "platform" will provide capabilities such as regression testing, performance/scale testing, bug analysis, and more. With a stable qualified routing code base and 24x7 support, service providers, academia, startup equipment vendors, and independent developers can accelerate existing projects like ALTO, Openflow, and software defined networks, and germinate new projects in service providers at a lower cost.


Want to be part of re-engineering datacenter networks at Amazon?

I need more help on a project I’m driving at Amazon where we continue to make big changes in our datacenter network to improve customer experience and drive down costs while, at the same time, deploying more gear into production each day than all of Amazon.com used back in 2000. It’s an exciting time and we have big changes happening in networking. If you enjoy and have experience in operating systems, networking protocol stacks, or embedded systems and you would like to work on one of the biggest networks in the world, send me your resume (james@amazon.com).




James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com


Saturday, October 01, 2011 8:08:59 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback

 Tuesday, September 20, 2011

If you read this blog in the past, you’ll know I view cloud computing as a game changer (Private Clouds are not the Future) and spot instances as a particularly powerful innovation within cloud computing. Over the years, I’ve enumerated many of the advantages of cloud computing over private infrastructure deployments. A particularly powerful cloud computing advantage is driven by noting that when combining a large number of non-correlated workloads, the overall infrastructure utilization is far higher for most workload combinations.  This is partly because the reserve capacity to ensure that all workloads are able to support peak workload demands is a tiny fraction of what is required to provide reserve surge capacity for each job individually.


This factor alone is a huge gain but an even bigger gain can be found by noting that all workloads are cyclic and go through sinusoidal capacity peaks and troughs. Some cycles are daily, some weekly, some hourly, and some on different cycles but nearly all workloads exhibit some normal expansion and contraction over time. This capacity pumping is in addition to handling unusual surge requirements or increasing demand discussed above.


To successfully run a workload, sufficient hardware must be provisioned to support the peak capacity requirement for that workload.  Cost is driven by peak requirements but monetization is driven by the average. The peak to average ratio gives a view into how efficiently the workload can be hosted.  Looking at an extreme, a tax preparation service has to provision enough capacity to support their busiest day and yet, in mid-summer, most of this hardware is largely unused. Tax preparation services have a very high peak to average ratio so, necessarily, utilization in a fleet dedicated to this single workload will be very low.


By hosting many diverse workloads in a cloud, the aggregate peak to average ratio trends towards flat. The overall efficiency to host the aggregate workload will be far higher than any individual workloads on private infrastructure.  In effect, the workload capacity peak to trough differences get smaller as the number of combined diverse workloads goes up.  Since costs tracks the provisioned capacity required at peak but monetization tracks the capacity actually being used, flattening this out can dramatically improve costs by increasing infrastructure utilization.


This is one of the most important advantages of cloud computing. But, it’s still not as much as can be done. Here’s the problem. Even with very large populations of diverse workloads, there is still some capacity that is only rarely used at peak. And, even in the limit with an infinitely large aggregated workload where the peak to average ratio gets very near flat, there still must be some reserved capacity such that surprise, unexpected capacity increases, new customers, or new applications can be satisfied.  We can minimize the pool of rarely used hardware but we can’t eliminate it.


What we have here is yet another cloud computing opportunity. Why not sell the unused reserve capacity on the spot market? This is exactly what AWS is doing with Amazon EC2 Spot Instances. From the Spot Instance detail page:


Spot Instances enable you to bid for unused Amazon EC2 capacity. Instances are charged the Spot Price set by Amazon EC2, which fluctuates periodically depending on the supply of and demand for Spot Instance capacity. To use Spot Instances, you place a Spot Instance request, specifying the instance type, the Availability Zone desired, the number of Spot Instances you want to run, and the maximum price you are willing to pay per instance hour. To determine how that maximum price compares to past Spot Prices, the Spot Price history for the past 90 days is available via the Amazon EC2 API and the AWS Management Console. If your maximum price bid exceeds the current Spot Price, your request is fulfilled and your instances will run until either you choose to terminate them or the Spot Price increases above your maximum price (whichever is sooner).

It’s important to note two points:

1.    You will often pay less per hour than your maximum bid price. The Spot Price is adjusted periodically as requests come in and available supply changes. Everyone pays that same Spot Price for that period regardless of whether their maximum bid price was higher. You will never pay more than your maximum bid price per hour.

2.    If you’re running Spot Instances and your maximum price no longer exceeds the current Spot Price, your instances will be terminated. This means that you will want to make sure that your workloads and applications are flexible enough to take advantage of this opportunistic capacity. It also means that if it’s important for you to run Spot Instances uninterrupted for a period of time, it’s advisable to submit a higher maximum bid price, especially since you often won’t pay that maximum bid price.


Spot Instances perform exactly like other Amazon EC2 instances while running, and like other Amazon EC2 instances, Spot Instances can be terminated when you no longer need them. If you terminate your instance, you will pay for any partial hour (as you do for On-Demand or Reserved Instances). However, if the Spot Price goes above your maximum price and your instance is terminated by Amazon EC2, you will not be charged for any partial hour of usage.

Spot instances effectively harvest unused infrastructure capacity. The servers, data center space, and network capacity are all sunk costs. Any workload worth more than the marginal costs of power is profitable to run. This is a great deal for customers in because it allows non-urgent workloads to be run at very low cost.  Spot Instances are also a great for the cloud provider because it further drives up utilization with the only additional cost being the cost of power consumed by the spot workloads. From Overall Data Center Costs, you’ll recall that the cost of power is a small portion of overall infrastructure expense.


I’m particularly excited about Spot instances because, while customers get incredible value, the feature is also a profitable one to offer.  Its perhaps the purest win/win in cloud computing.


Spot Instances only work in a large market with many diverse customers. This is a lesson learned from the public financial markets. Without a broad number of buyers and sellers brought together, the market can’t operate efficiently. Spot requires a large customer base to operate effectively and, as the customer base grows, it continues to gain efficiency with increased scale.


I recently came across a blog posting that ties these ideas together: New CycleCloud HPC Cluster Is a Triple Threat: 30000 cores, $1279/Hour, & Grill monitoring GUI for Chef. What’s described in this blog posting is a mammoth computational cluster assembled in the AWS cloud. The speeds and feeds for the clusters:

·         C1.xlarge instances:           3,809

·         Cores:                                  30,472

·         Memory:                              26.7 TB


The workload was molecular modeling. The cluster was managed using the Condor job scheduler and deployment was automated using the increasingly popular Opscode Chef. Monitoring was done using a packaged that CycleComputing wrote that provides a nice graphical interface to this large cluster: Grill for CycleServer (very nice).

 The cluster came to life without capital planning, there was no wait for hardware arrival, no datacenter space needed to be built or bought, the cluster ran 154,116 condor jobs with 95,078 compute hours of work and, when the project was done, was torn down without a trace.


What is truly eye opening for me in this example is that it’s a 30,000 core cluster for $1,279/hour. The cloud and Spot instances changes everything. $1,279/hour for 30k cores. Amazing.


Thanks to Deepak Singh for sending the CycleServer example my way.




James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com


Tuesday, September 20, 2011 5:44:17 AM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback

 Tuesday, August 16, 2011

I got a chance to chat with Eric Baldeschwieler while he was visiting Seattle a couple of weeks back and catch up on what’s happening in the Hadoop world at Yahoo and beyond. Eric recently started Hortonworks whose tag line is “architecting the future of big data.” I’ve known Eric for years when he led the Hadoop team at Yahoo! most recently as VP of Hadoop Engineering.  It was Eric’s team at Yahoo that contributed much of the code in Hadoop, Pig, and ZooKeeper. 


Many of that same group form the core of Hortonworks whose mission is revolutionize and commoditize the storage and processing of big data via open source. Hortonworks continues to supply Hadoop engineering to Yahoo! And Yahoo! Is a key investor in Hortonworks along with Benchmark Capital. Hortonworks intends to continue to leverage the large Yahoo! development, test, and operations team.  Yahoo! has over 1,000 Hadoop users and are running Hadoop over many clusters the largest of which was 4,000 nodes back in 2010. Hortonworks will be providing level 3 support for Yahoo! Engineering.


From Eric slides at the 2011 Hadoop summit, Hortonworks objectives:

      Make Apache Hadoop projects easier to install, manage & use

        Regular sustaining releases

        Compiled code for each project (e.g. RPMs)

        Testing at scale

      Make Apache Hadoop more robust

        Performance gains

        High availability

        Administration & monitoring

      Make Apache Hadoop easier to integrate & extend

        Open APIs for extension & experimentation


Hortonworks Technology Roadmap:

·         Phase 1: Making Hadoop Accessible (2011)

o   Release the most stable Hadoop version ever

o   Release directly usable code via Apache (RPMs, debs,…)

o   Frequent sustaining releases off of the stable branches

·         Phase 2: Next Generation Apache Hadoop (2012)

o   Address key product gaps (Hbase support, HA, Management, …)

o   Enable community and partner innovation via modular architecture & open APIs

o   Work with community to define integrated stack


Next generation Apache Hadoop:

·         Core

o   HDFS Federation

o   Next Gen MapReduce

o   New Write Pipeline (HBase support)

o   HA (no SPOF) and Wire compatibility

·         Data - HCatalog 0.3

o   Pig, Hive, MapReduce and Streaming as clients

o   HDFS and HBase as storage systems

o   Performance and storage improvements

·         Management & Ease of use

o   All components fully tested and deployable as a stack

o   Stack installation and centralized config management

o    REST and GUI for user tasks


Eric’s presentation from Hadoop Summit 2011 where he gave the keynote: Hortonworks: Architecting the Future of Big Data


James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com


Tuesday, August 16, 2011 4:49:00 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Monday, August 01, 2011

It’s a clear sign that the Cloud Computing market is growing fast and the number of cloud providers is expanding quickly when startups begin to target cloud providers as their primary market. It’s not unusual for enterprise software companies to target cloud providers as well as their conventional enterprise customers but I’m now starting to see startups building products aimed exclusively at cloud providers. Years ago when there were only a handful of cloud services, targeting this market made no sense. There just weren’t enough buyers to make it an interesting market. And, many of the larger cloud providers are heavily biased to internal development further reducing the addressable market size.


A sign of the maturing of the cloud computing market is there now many companies interested in offering a cloud computing platform not all of which have substantial systems software teams. There is now a much larger number of companies to sell to and many are eager to purchase off the shelf products. Cloud providers have actually become a viable market to target in that there are many providers of all sizes and the overall market continues to expand faster than any I have seen any I’ve seen over the last 25 years.


An excellent example of this new trend of startups aiming to sell to the Cloud Computing market is SolidFire which targets the high performance block storage market with what can be loosely described as a distributed Storage Area Network. Enterprise SANs are typically expensive, single-box, proprietary hardware.  Enterprise SANs are mostly uninteresting to cloud providers due to high cost and the hard scaling limits that come from scale-up solutions. SolidFire implements a virtual SAN over a cluster of up to 100 nodes. Each node is a commodity 1RU, 10 drive storage server.  They are focused on the most demanding random IOPS workloads such as database and all 10 drives in the SolidFire node are Solid State Storage devices. The nodes are interconnected by up 2x 1GigE and 2x10GigE networking ports.


In aggregate, each node can deliver a booming 50,000 IOPS and the largest supported cluster with 100 nodes can support 5m IOPS in aggregate. The 100 node cluster scaling limit may sound like a hard service scaling limit but multiple storage clusters can be used to scale to any level.  Needing multiple clusters has the disadvantage of possibly fragmenting the storage but the advantage of dividing the fleet up into sub-clusters with rigid fault containment between them limiting the negative impact of software problems. Reducing the “blast radius” of a failure makes moderate sized sub-clusters a very good design point.


Offering distributed storage solution isn’t that rare – there are many out there. What caught my interest at SolidFire was 1) their exclusive use of SSDs, and 2) an unusually nice quality of service (QoS) approach. Going exclusively with SSD makes sense for block storage systems aimed exclusively at high random IOPS workload but they are not a great solution for storage bound workloads. The storage for these workloads is normally more cost-effectively hosted on hard disk drives. For more detail on where SSDs are win an where they are not:


·         When SSDs Make Sense in Server Applications

·         When SSDs Don’t Make Sense in Server Applications

·         When SSDs make sense in Client Applications (just about always)


The usual solution to this approach is do both but SolidFire wanted a single SSD optimized solution that would be cost effective across all workloads.  For many cloud providers, especially the smaller ones, a single versatile solution has significant appeal.


The SolidFire approach is pretty cool. They exploit the fact that SSDs have abundant IOPS but are capacity constrained and trade off IOPS to get capacity. Dave Wright the SolidFire CEO describes the design goal as SSD performance at a spinning media price point. The key tricks employed:

·         Multi-Layer Cell Flash: They use MLC Flash Memory storage since it is far cheaper than Single Level Cell, the slightly lower IOPS rate supported by MLC is still more than all but a handful of workloads require and they can solve the accelerated wear issues with MLC in the software layers above

·         Compression: Aggregate workload dependent gains estimated to be 30 to 70%

·         Data Deduplication: Aggregate workload dependent gains estimated to be 30 to 70%

·         Thin Provisioning: Only allocate blocks to a logical volume as they are actually written to. Many logical volumes never get close to the actual allocated size.

·         Performance Virtualization: Spread all volumes over many servers. Spreading the workload at a sub-volume level allows more control of meeting the individual volume performance SLA with good utilization and without negatively impacting other users


The combination of the capacity gains of thin provisioning, duplication, and compression bring the dollars per GB of the SolidFire solution very near to some hard disk based solutions at nearly 10x the IOPS performance.


The QoS solution is elegant in that they have three settings that allow multiple classes of storage to be sold. Each logical volume has 2 QoS settings: 1) Bandwidth, and 2) IOPS. Each setting has a min, max, and burst capacity setting.  The min setting sets a hard floor where capacity is reserved to ensure this resource is always available. The burst is the hard ceiling that prevents a single user for consuming excess resource. The max is the essentially the target. If you run below the max you build up credits that allow a certain time over the max. The Burst limits the potential negative impact of excursions above max on other users.


This system can support workloads that need dead reliable, never changing I/O requirements. It can also support dead reliable average case with rare excursions above (e.g. during a database checkpoint). Its also easy to  support workloads that soak up resources left over after satisfying the most demanding workloads without impacting other users. Overall, a nice simple and very flexible solution to a very difficult problem.




James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com


Monday, August 01, 2011 12:49:03 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Sunday, July 24, 2011

Great things are happening in the networking market. We’re transitioning from vertically integrated mainframe-like economics to a model similar to what we have in the server world. In the server ecosystems, we have Intel, AMD and others competing to provide the CPU. At the layer above, we have ZT Systems, HP, Dell DCS, SGI, IBM, and many others building servers based upon whichever CPU the customer choses to use.  Above that layer, we have wide variety of open sources and proprietary software stacks that run on servers from any of the providers which include any of the CPU providers silicon. There is competition at all layers in the stack.


The networking world is finally heading down the right path with competing merchant silicone providers for the core data plane processing engines used in networking gear (Application Specific Integrated Circuits or ASICs). These ASICs are supplied by Broadcom, Fulcrum, Marvel, and others.  A given ASIC will be built into switch gear by many competitors. Above that layer, there are open source networking protocol stacks (Quagga and Xorp) as well as ASIC independent commercial protocol stacks (IP Infusion and Ariscent) and also proprietary stacks.


This is all good news.  In fact, things are moving so fast we’re already starting to see some consolidation in the industry with Broadcom acquiring Dune Networks in November 2009 and just last week Intel acquired Fulcrum Microsystems. The Intel acquisition of Fulcrum is a potentially a good thing for the industry in that Fulcrum may now have even more investment capital. However, Fulcrum were hardly starving before having been through five rounds of venture funding netting a booming $102M.  Broadcom remains the biggest player in the networking merchant silicon market with a market capitalization of $19.1B.


Hopefully this major acquisition will even further ramp up the pace of innovation and competition in the networking world. I suspect it will even though it feels early for this market to be consolidating.


Related press articles:

·         Intel press release

·         The Register

·         CNET

Related blogs:

·         Andy Bechtolsheim: The March to Merchant Silicon in 10Gbe Cloud Networking

·         Datacenter Networks are in My Way




James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com


Sunday, July 24, 2011 3:13:14 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Thursday, June 23, 2011

Earlier this week, I was in Athens Greece attending annual conference of the ACM Machinery Special Interest Group on Management of Data. SIGMOD is one of the top two database events held each year attracting academic researchers and leading practitioners from industry.


I kicked off the conference with the Plenary keynote. In this talk I started with a short retrospection on the industry over the last 20 years. In my early days as a database developer, things were moving incredibly quickly. Customers were loving our products, the industry was growing fast and yet the products really weren’t all that good. You know you are working on important technology when customers are buying like crazy and the products aren’t anywhere close to where they should be.


In my first release as lead architect on DB2 20 years ago, we completely rewrote the DB2 database engine process model moving from a process-per-connected-user model to a single process where each connection only consumes a single thread supporting many more concurrent connections. It was a fairly fundamental architectural change completed in a single release. And in that same release, we improved TPC-A performance a booming factor of 10 and then did 4x more in the next release. It was a fun time and things were moving quickly.


From the mid-90s through to around 2005, the database world went through what I refer to as the dark ages. DBMS code bases had grown to the point where the smallest was more than 4 million lines of code, the commercial system engineering teams would no longer fit in a single building, and the number of database companies shrunk throughout the entire period down to only 3 major players. The pace of innovation was glacial and much of the research during the period was, in the words of Bruce Lindsay, “polishing the round ball”. The problem was that the products were actually passably good, customers didn’t have a lot of alternatives, and nothing slows innovation like large teams with huge code bases.


In the last 5 years, the database world has become exciting again. I’m seeing more opportunity in the database world now than any other time in the last 20 years. It’s now easy to get venture funding to do database products and the number of and diversity of viable products is exploding. My talk focused on what changed, why it happened, and some of the technical backdrop influencing.


A background thesis of the talk is that cloud computing solves two of the primary reasons why customers used to be stuck standardizing on a single database engine even though some of their workloads may have run poorly. The first is cost. Cloud computing reduces costs dramatically (some of the cloud economics argument: http://perspectives.mvdirona.com/2009/04/21/McKinseySpeculatesThatCloudComputingMayBeMoreExpensiveThanInternalIT.aspx) and charges by usage rather than via annual enterprise license. One of the favorite lock-ins of the enterprise software world is the enterprise license. Once you’ve signed one, you are completely owned and it’s hard to afford to run another product.  My fundamental rule of enterprise software is that any company that can afford to give you 50% to 80% reduction from “list price” is pretty clearly not a low margin operator. That is the way much of the enterprise computing world continues to work: start with a crazy price, negotiate down to a ½ crazy price, and then feel like a hero while you contribute to incredibly high profit margins.


Cloud computing charges by the use in small increments and any of the major database or open source offerings can be used at low cost. That is certainly a relevant reason but the really significant factor is the offloading of administrative complexity to the cloud provider.  One of the primary reasons to standardize on a single database is that each is so complex to administer, that it’s hard to have sufficient skill on staff to manage more than one. Cloud offerings like AWS Relational Database Service transfer much of the administrative work to the cloud provider making it easy to chose the database that best fits the application and to have many specialized engines in use across a given company.


As costs fall, more workloads become practical and existing workloads get larger.  For example, If analyzing three months of customer usage data has value to the business and it becomes affordable to analyze two years instead, customers correctly want to do it. The plunging cost of computing is fueling database size growth at a super-Moore pace requiring either partitioned (sharded) or parallel DB engines.


Customers now have larger and more complex data problems, they need the products always online, and they are now willing to use a wide variety of specialized solutions if needed. Data intensive workloads are growing quickly and never have there been so many opportunities and so many unsolved or incompletely solved problems. It’s a great time to be working on database systems.


·         The slides from the talk: http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_Sigmod2011Keynote.pdf

·         Proceedings extended abstract: http://www.sigmod2011.org/keynote_1.shtml

·         Video of talk: https://services.choruscall.eu/links/sigmod1106.html# (select June 14th to get to the video)

The talk video is available but, unfortunately, only to ACM digital library subscribers (thanks to Simon Leinen for pointing out the availability of the video link above).



James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com


Thursday, June 23, 2011 3:21:11 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Thursday, June 09, 2011

The Amazon Technology Open House was held Tuesday night at the Amazon South Lake Union Campus. I did a short presentation on the following:


       Quickening pace of infrastructure innovation

       Where does the money go?

       Power distribution infrastructure

       Mechanical systems

       Modular & Advanced Building Designs

       Sea Change in Networking


Slides and notes:




James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com


Thursday, June 09, 2011 5:40:21 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Friday, June 03, 2011

Earlier today Alex Mallet reminded me of the excellent writing of Atul Gawnade by sending me a pointer to the New Yorker coverage of Gawande’s commencement address at the Harvard Medical School: Cowboys and Pit Crews.


Four years ago I wrote a couple of blog entries on Gawande’s work but, at the time, my blog was company internal so I’ve not posted these notes here in the past:


As a follow-on to the posting I made on professional engineering (also posted externally http://perspectives.mvdirona.com/2007/11/07/ProfessionalEngineering.aspx) Edwin Young sent me a link to the following talk by Atul Gawande: Outcomes are very Personal. It’s from another domain, medicine, but is a phenomenally good presentation by a surgeon and his core premise applies equally to software: practitioners work and the outcomes of that work are spread on a bell curve.  The truly great are much better than the average and often an order of magnitude better than the lowest performing.  His book and the presentation is about the personal attributes and approaches of those at the very top. It’s well worth a view: http://www.youtube.com/watch?v=MbNu6LY5sMY.


In my view, it’s an insightful presentation by a surgeon who loves data, loves understanding why we do well and how we can do better and is relentless in pursuit himself of doing everything better.  Subsequent to watching the presentation, I read a book by the same author “Better: A Surgeon's Notes on Performance” (http://www.amazon.com/Better-Surgeons-Performance-Atul-Gawande/dp/0805082115). 


Software, like surgery, is part art and part science and there is tremendous variability between the average and the best.  Gawande studies the best in different specializations to understand why the performance of some practitioners is way out there at the positive end of the bell curve and, through a series of essays, makes observations on how to do improve performance of the population Aoverall.  Understanding that human performance is distributed on the bell curve means that, for whatever it is you are doing, there are average performers, terrible performers, and truly gifted performers.  Gawade looks for what he calls positive deviance – it’s always there in a bell curve distributed phenomena – and tries to understand what they do differently.  Worth reading.


A sampling of Gawande’s work:

·         Book: http://www.amazon.com/Better-Surgeons-Performance-Atul-Gawande/dp/0805082115

·         Video: http://www.youtube.com/watch?v=MbNu6LY5sMY

·         Commencement Speech Notes: http://www.newyorker.com/online/blogs/newsdesk/2011/05/atul-gawande-harvard-medical-school-commencement-address.html




James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com


Friday, June 03, 2011 7:24:06 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Tuesday, May 31, 2011

As a boater, there are times when I know our survival is 100% dependent upon the weather conditions, the boat, and the state of its equipment. As a consequence, I think hard about human or equipment failure modes and how to mitigate them. I love reading the excellent reporting by the UK Marine Accident Investigation Board. This publication covers human and equipment related failures on commercial shipping, fishing, and recreational boats. I read it carefully and I’ve learned considerably from it.


I treat my work in much the same way. At work, human life is not typically at risk but large service failures can be very damaging and require the same care to avoid. As a consequence, at work I also think hard about possible human or equipment failure modes and how to mitigate them.


Wanting to deeply understand unusual failure modes and especially wanting to understand the errors that humans make when managing systems under stress, I spend time reading about system failures. Considerable learning can be drawn from reading about the failures of engineered systems and people under stress. All disasters or near disasters yield some unique lessons and re-enforce some old ones.


The hard part for me is getting enough detail to really learn from the situation. The press reports are often light on details partly because general audiences are not necessarily that interested but there also may be legal or competitive constraints preventing broad publication. NASA, FAA, Coast Guard and some other government reports to get to excellent detail. One analysis of system failure I learned greatly from was Feynman’s analysis of the space shuttle Challenger disaster as part of the Rogers Commission Report.


I just came across another report that is not quite a Feynman  classic but it is an excellent, just-the-facts description of a large scale failure. This report, from IEEE Spectrum, titled What Went Wrong in Japan’s Nuclear Reactors outlines what happened in the eventually catastrophic disaster at Japan’s Fukushima Dai-1 nuclear facility following the Tohoku earthquake and subsequent tsunami. In this report, the terminal failures of 4 of the 6 reactors at the facility is described in more detail than other accounts of that event I’ve come across.

All disasters are unique in some dimensions. What makes Fukushima particularly unusual is these failures occurred over multiple weeks rather than the seconds to hours of many events.  This one was relatively slow to develop and even slower to be brought under control. Looking forward, I suspect Fukushima will share some characteristics with Chernobyl where mitigating the environmental damage is still nowhere close to complete nearly three decades later. In 1998 the Ukraine government obtained economic aid from the European Bank for Reconstruction and Development to rebuild the failing Chernobyl sarcophagus. It is expected that yet more work will need to be done to continue to contain dangerous radioactive substances from escaping.  Similarly, I expect the environmental impact of the Fukushima disaster will be fought for decades at great cost both economic and human.


In many ways Fukishima was a classic disaster where a not particularly surprising event, in this case an earthquake near Japan, started the failure and then cascading natural disaster, equipment failure, and human decisions followed to yield an outcome that every aspect of the system design sought to avoid.


I recommend reading the IEEE report linked below and my rough notes from the write-up follow:

·         On March 11 an earthquake registering 9.0 magnitude was experienced off the coast of Japan

·         The tsunami hit the plant destroying power distribution gear cutting off power to the Fukushima facility

·         Backup generators and switch gear were also disabled by the Tsunami

·         Reactor building integrity was maintained through earthquake and Tsunami and the three reactors that were active at that point where all shut down properly

·         Due to the power failure and the damage to distribution gear and generators, plant cooling systems were not operating at any of the reactors nor the spent fuel rod storage pools

·         Even though the nuclear reaction had been stopped in the three reactors that were operational when the tsunami hit (reactors 1, 2, & 3), considerable heat was still being created putting the reactors at risk of meltdown. Meltdown is a condition where reactor core over temperature occurs, the coolant is boiled off, the fuel rods melt and form a pool of very hot, highly radioactive fuel in the bottom of the reactor. This hot, radioactive fluid then rapidly breaks down steel and concrete in the containment vessel and possibly escapes to the environment.

·         Another area of risk from the failed cooling systems are the spent nuclear fuel rod storage pools. These pools are also housed inside the reactor buildings near the primary containment vessel where the active nuclear reaction actually takes place. Although the fuel rods are no longer contributing to a nuclear reaction, they are both highly radioactive and still producing sufficient heat that active cooling is required. Without cooling these rods can heat the storage pool to the point that it boils off the cooling water and present a risk similar to the active rods inside the primary storage vessel.

·         I find it surprising that both the spent rod storage and the shut down reactor cores don’t appear to fail safe and self-stabilize when cooling water is removed given the considerably higher than zero probability of power failure and the seriously negative impact of radioactive release to the environment.

·         Events at Reactor #1:

o   March 12, a day after the power failure, heat in the recently shutdown reactor built up until the (not circulating) cooling water began to be boiled off.

o   As the water level fell, the now exposed fuel rods reacted with the steam in the primary containment vessel, and began producing hydrogen gas

o   The pressure rose to dangerous levels in the primary containment vessel and operators decided to vent the primary containment vessel into the reactor building.

o   The vented hydrogen gas when exposed to the relatively oxygen-rich environment in the reactor building, exploded blowing the top off the reactor building

o   The explosion may have also damaged the primary containment vessel and definitely released radioactive material

o   The operators chose to pump seawater into the building in an effort to control the escalating temperature inside the reactor and to avoid core meltdown

o   March 29, radioactive water was found outside the reactor building

o   April 5, reactor core temperatures have begun to fall indicating the system is coming back into control

o   Radioactivity levels in the building are very high and operators are injecting nitrogen to reduce the likelihood of subsequent hydrogen explosions.

o   May 12, TEPCO officials confirmed that the reactor had suffered a core meltdown and the bottom of the reactor building may be leaking highly radioactive water into the environment.

·         Events at Reactor #3:

o   March 14, 3 days after the tsunami and 2 days after the roof was blown off the Reactor #1 containment building, the same thing happened on Reactor #3

o   This explosion occurred despite plant operators pumping large quantities of cooling sea water into the reactor building

o   March 17, steam begins billowing from the reactor building confirming that the primary containment vessel was damaged and releasing radioactive compounds.

o   Helicopters dumped water on the building and police water cannons were used to pour water down onto the building.

o   Water was sprayed on the building for days with some interruptions as radiations levels rose sufficiently high that work had to be stopped.

o   March 24, workers laying power cables attempting to restore power to Reactor #3 waded into highly radioactive water requiring hospitalization.

o   March 28, dangerous plutonium was detected in the environment near Reactor #3.

·         Events at Reactor #2:

o   March 15, 4 days after the tsunami, 3 days after the roof was blown off Reactor #1, and a day after the roof was blown off Reactor #3, a serious explosion occurred at Reactor #2.

o   Reactor #2 was later confirmed to have experienced at least a partial core meltdown

o   March 27, highly radioactive water discovered outside of reactor building #2.

·         Subsequently large quantities of uncontained radioactive water has been found throughout the multi-reactor plan and the turbine facilities are flooded as are the cabling tunnels between the buildings. Serious radioactive water leaks into the ocean have been detected and subsequently corrected in one case by injecting 6,000 liters of liquid glass into the ground near the leak.

·         April 4th, 11,500 tons of radioactive water is pumped into the ocean. This water is 100x above the legal safety limit but was pumped into the environment in the hope that the storage facilities can be used to contain waste water that is 10,000x time radioactive limit for environmental release.

·         The spent fuel pools at the inactive reactors 4, 5, & 6 were all slowly overheating as a consequence of there being no cooling water. The Reactor #4 cooling pool either boiled off its water or it leaked off as a result of earthquake damage. The spent fuel rods exposed to atmosphere without cooling lead to fires inside Reactor building #4

·         Outcome:

o   Fukushima now rated to be as serious as the Chernobyl having been classified as a a magnitude 7 event, the worst on the International Nuclear Event Scale. However it is still consider to have released only 5 to 10% of the radiation released by Chernobyl.

o   All residents within 20 km evacuated

o   Voluntary evacuation of all residents between 20 and 30 km.

o   Agricultural products including milk and vegetables from the region contaminated

o   Tokyo’s tap water declared unfit for infants for 1 day

o   Decades of cleanup and containment remain


The report: What Went Wrong in Japan's Nuclear Reactors: http://spectrum.ieee.org/tech-talk/energy/nuclear/explainer-what-went-wrong-in-japans-nuclear-reactors.


We all wish the situation had been avoided and, those of us involved in engineering projects whether they be life critical systems or not, need to ensure that the lessons from this one are learned well and applied faithfully to new designs. I won’t speculate on human risk in the efforts spent to mitigate this disaster but, clearly, the workers that brought these systems back under control and continue to manage the environmental impact are heroes and deserve our collective thanks. 




James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com



Tuesday, May 31, 2011 5:39:33 AM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback
 Wednesday, May 25, 2011

The European Data Center Summit 2011 was held yesterday at SihlCity CinCenter in Zurich. Google Senior VP Urs Hoelzle kicked off the event talking about why data center efficiency was important both economically and socially.  He went on to point out that the oft quoted number that US data centers represent is 2% of total energy consumption is usually mis-understood. The actual data point is that 2% of the US energy budget is spent on IT of which the vast majority is client side systems. This is unsurprising but a super important clarification.  The full breakdown of this data:


·         2% of US power

o   Datacenters:              14%

o   Telecom:                     37%

o   Client Device:            50%


The net is that 14% of 2% or 0.28% of the US power budget is consumed in datacenters.  This is a far smaller but still a very relevant number. In fact, that is the primary motivator behind the conference: how to get the best practices from industry leaders in datacenter efficiency available more broadly .


To help understand why this is important,

·         Of the 0.28% energy consumption by datacenters:

o   Small:            41%

o   Medium:     31%

o   Large:            28%


This later set of statistics predictably shows that the very largest data centers consume 28% of the data center energy budget while small and medium centers consume 72%.  High scale datacenter operates have large staffs of experts focused on increasing energy efficiency but small and medium sized centers can’t afford this overhead at their scale. Urs’s point and the motivation behind the conference is we need to get industry best practices available to all data center operations.


The driving goal behind the conference is that extremely efficient datacenter operations are possible using only broadly understood techniques. No magic is required.  It is true that the very large operators will continue to enjoy even better efficiency but existing industry best practices can easily get even small operators with limited budgets to within a few points of the same efficiency levels.


Using Power Usage Effectiveness as the measure while the industry leaders are at 1.1 to 1.2 where 1.2 means that every watt delivered to the servers requires 1.2 watts to be deliverd from the utility. Effectively it is a measure of the overhead or efficiency of the datacenter infrastructure. Unfortunately the average remains in the 1.8 to 2.0 range and the worst facilities can be as poor as 3.0.


Summarizing: Datacenters consume 0.28% of the annual US energy budget. 72% of these centers are small and medium sized centers that tend towards the lower efficiency levels.


The Datacenter Efficiency conference focused on making cost effective techniques more broadly understood showing how a PUE of 1.5 is available to all without large teams of experts or huge expense. This is good for the environment and less expensive to operate.




James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com


Wednesday, May 25, 2011 5:52:18 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
 Monday, May 23, 2011

Guido van Rossum was at Amazon a week back doing a talk. Guido presented 21 Years of Python: From Pet Project to Programming Language of the Year.


The slides are linked below and my rough notes follow:

·         Significant Python influencers:

o   Algol 60, Pascal, C

o   ABC

o   Modula0-2+ and 3

o   Lisp and Icon

·         ABC was the strongest language influencer of this set

·         ABC design goals:

o   Professionals but not professional programmers (lab personal, scientists, etc.)

o   Easy to teach, easy to learn, easy to use

·         Parts of ABC most liked by Guido:

o   Design iterations based on user testing

§  E.g. colon before indented blocks

o   Simple design: IF, WHILE, FOR, …

o   Indentation for grouping (Knuth, occam)

o   Tuples, lists, dictionaries (though changed)

o   Immutable data types

o   No limits

o   The >>> prompt

·         Parts of ABC that most needed improvement:

o   Monolithic design – not extensible

§  E.g. no graphics, not easily added

o   Invented non-standard terminology

§  E.g. “how-to” instead of “procedure”

o   ALL'CAPS keywords

o   No integration with rest of system

§  No file-based I/O (persistent variables instead)

·         The beginnings of Python:

o   Amoeba project at CWI

§  Writing apps in C and sh and wanting something in between

·         Python design philosophy:

o   Borrow ideas whenever it makes sense

o   As simple as possible, no simpler (Einstein)

o   Do one thing well (UNIX)

o   Don’t fret about performance (fix it later)

o   Go with the flow (don’t fight environment)

o   Perfection is the enemy of the good

o   Cutting corners is okay (get back to it later)

·         User Centric Design Philosophy:

o   Avoid platform ties, but not religiously

o   Don’t bother the user with details

o   Discourage but allow coding to the platform

o   Offer multiple levels of extensibility

o   Errors should not be fatal, if possible

o   Errors should never pass silently

o   Don’t blame the user for bugs in Python

·         Core language stabilized quickly in the 1990 to 1991 timeframe

·         Early days of active Python community:

o   1990 – internal at CWI

§  More internal use than ABC ever had

§  Internal contributors

o   1991 – first release; python-list@cwi.nl

o   1994 – USENET group comp.lang.python

o   1994 – first workshop (NIST)

o   1995-1999 – from workshops to conferences

o   1995 – Python Software Association

o   1997 – www.python.org goes online

o   1999 – Python Consortium

§  Modeled after X Consortium

o   2001 – Python Software Foundation

§  Modeled after Apache Software Foundation

·         Present day Python community:

o   PSF runs largest annual Python conference

§  PyCon Atlanta in 2011: 1500 attendees

§  2012-2013: Toronto; 2014-2015: Bay area

§  Also sponsors regional PyCons world-wide

o   EuroPython since 2002

o   Many local events, user groups

o   python.org

o   docs.python.org, mail.python.org, bugs.python.org, hg.python.org,
planet.python.org, wiki.python.org

o   Stackoverflow etc.

·         Python 2 vs Python 3

o   Fixing deep bugs intrinsic in the design

o   Avoid two extremes:

§  perpetual backwards compatibility (C++)

§  rewrite from scratch (Perl 6)

o   Our approach:

§  evolve the implementation gradually

§  some backwards incompatibilities

§  separate tools to help users cope


Thanks to Guido for doing the well received Python presentation.


Guido’s slides and blog URLS:

·         Slides: http://mvdirona.com/jrh/TalksAndPapers/GuidoVanRossum_21_years_of_python.pdf

·         Blog: http://python-history.blogspot.com




James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com


Monday, May 23, 2011 9:42:12 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Friday, May 20, 2011

I invited Nikhil Handigol to present at Amazon earlier this week. Nikhil is a Phd candidate at Stanford University working with networking legend Nick McKeown on the Software Defined Networking team. Software defined networking is an concept coined by Nick where the research team is separating the networking control plane from the data plane. The goal is a fast and dumb routing engine with the control plane factored out and supporting an open programming platform.


From Nikil’s presentation, we see the control plane hoisted up to a central, replicated network O/S configuring the distributed routing engines in each switch.


One implementation of software defined networking is OpenFlow where each router supports the OpenFlow protocol and a central OpenFlow Controller computes routing tables that are installed in each router:


What makes OpenFlow especially interesting is that it’s simple, easy to implement, and getting broad industry support with the Open Networking Foundation as the central organizing body.  The Open Networking Foundation’s primary mission is to advance software defined networking using OpenFlow as the protocol. Founding members of the Open Networking Foundation are Deutsche Telekom, Facebook, Google, Microsoft, Verizon, and Yahoo!.  Also included are networking equipment providers including: Broadcom, Dell, Cisco, Force10, HP, Juniper, Marvell, Mellanox, and many others.


Today, most networking equipment is shipped as a vertically integrated stack including both the control and data planes. There are many reasons why this is not good for the industry. The Stanford team argues it blocks innovation in that researches can’t try new protocols with a closed stack without a programming model.  I agree. This is a problem for both academia and industry but my dislike of the current model is much broader. In Networking: The Last Bastion of Mainframe Computing, I made the case that this vertically integrated approach is artificially holding prices high and slowing the pace of innovation. A quick summary of the argument:


When networking equipment is purchased, it’s packaged as a single sourced, vertically integrated stack. In contrast, in the commodity server world, starting at the most basic component, CPUs are multi-sourced. We can get CPUs from AMD and Intel. Compatible servers built from either Intel or AMD CPUs are available from HP, Dell, IBM, SGI, ZT Systems, Silicon Mechanics, and many others.  Any of these servers can support both proprietary and open source operating systems. The commodity server world is open and multi-sourced at every layer in the stack.


Open, multi-layer hardware and software stacks encourage innovation and rapidly drive down costs. The server world is clear evidence of what is possible when such an ecosystem emerges.


I’m excited about software defined networking because it provides a clean interface allowing switch providers to both innovate and compete. An additional benefit is that SDN allows innovation and experimentation at the network protocol layer.


In Nikil’s talk last week at Amazon, he explored integrating load balancing functionality into the network routing fabric. The team started with the hypothesis that load balancing is really just smart routing. They then implemented a distributed load balancing fabric by adding load balancing support to network routers using Software Defined Networking. Essentially they distribute the load balancing functionality throughout the network. What’s unusual here is that the ideas could be tested and tried over a 9 campus, North American wide network with only 500 lines of code. With conventional network protocol stacks, this research work would have been impossible in that vendors don’t open up protocol stacks. And, even if they did, it would have been complex and very time consuming.




James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com


Friday, May 20, 2011 10:55:51 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Wednesday, May 04, 2011

This note looks at the Open Compute Project distributed Uninterruptable Power Supply (UPS) and server Power Supply Unit (PSU). This is the last in a series of notes looking at the Open Compute Project. Previous articles include:

·         Open Compute Project

·         Open Compute Server Design

·         Open Compute Mechanical Design


The open compute uses a semi-distributed uninterruptable power supply (UPS) system. Most data centers use central UPS systems where large the UPS is part of the central power distribution system. In this design, the UPS is in the 480 3 phase part of the central power distribution system prior to the step down to 208VAC. Typical capacities range from 750kVA to 1,000kVA. An alternative approach is a distributed UPS like that used by a previous generation Google server design.

In a distributed UPS, each server has its own 12VDC battery to serve as backup power. This design has the advantage of being very reliable with the batter directly connected to the server 12V rail. Another important advantage is the small fault containment zone (small “blast radius”) where a UPS failure will only impact a single server. With a central UPS, a failure could drop the load on 100 racks of servers or more. But, there are some downsides of distributed UPS. The first is that batteries are stored with the servers. Batteries take up considerable space, can emit corrosive gasses, don’t operate well at high temperature, and require chargers and battery monitoring circuits. As much as I like aspects of the distributed UPS design, it’s hard to cost effectively and, consequently, is very uncommon.


The Open Compute UPS design is a semi-distributed approach where each UPS is on the floor with servers but rather than having 1 UPS per server (distributed UPS) or 1 UPS per order 100 racks (central UPS with roughly 4,000 servers), they have 1 UPS per 6 racks (180 servers).




In this design the battery rack is the central rack flanked by two triple racks of servers. Like the server racks, the UPS is delivered 480VAC 3 phase directly. At the top of the battery rack, they have control circuitry, circuit breakers, and rectifiers to charge the battery banks.


What’s somewhat unusual in the output stage of the UPS doesn’t include inverters to convert the direct current back to the alternating current required by a standard server PSU. Instead the UPS output is 48V direct current which is delivered directly to the three racks on either side of the UPS. This has the upside of avoiding the final invert stage which increases efficiency. There is a cost to avoiding converting back to AC.  The most important downside is they need to effectively have two server power supplies where one accepts 277VAC and the other accepts 48VDC from the UPS. The second disadvantage is using 48V distribution is inefficient over longer distances due to conductor losses at high amperage.


The problem with power distribution efficiency is partially mitigated by keeping the UPS close to servers where the 6 racks its feeds are on either side of the UPS so the distances are actually quite short. And, since the UPS is only used during short periods of time between the start of a power failure and the generators taking over, the efficiency of distribution is actually not that important a factor. The second issue remains, each server power supply is effectively two independent PSUs.


The server PSU looks fairly conventional in that it’s a single box. But, included in the single box, is two independent PSUs and some control circuitry. This has the downside of forcing the use of a custom, non-commodity power supply. Lower volume components tend to cost more. However, the power supply is a small part of the cost of a server so this additional cost won’t have a substantially negative impact. And, it’s a nice, reliable design with a small fault containment zone which I really like.


The Open Compute UPS and power distribution system avoids one level of power conversion common in most data centers, delivers somewhat higher voltages (277VAC rather than 208VAC) close to the load, and has the advantage of a small fault zone.


James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com


Wednesday, May 04, 2011 5:24:34 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Friday, April 29, 2011

Google cordially invites you to participate in a European Summit on sustainable Data Centres. This event will focus on energy-efficiency best practices that can be applied to multi-MW custom-designed facilities, office closets, and everything in between. Google and other industry leaders will present case studies that highlight easy, cost-effective practices to enhance the energy performance of Data Centres.

The summit will also include a dedicated session on cooling. Presenters will detail climate-specific implementations of free cooling as well as novel ways to utilise locally -available opportunities. We will also debate climate-independent PUE targets.

The agenda includes presentations and panel discussions featuring Amazon, DeepGreen, eBay, Google, IBM, Microsoft, Norman Disney & Young, PlusServer, Telecity Group, The Green Grid, UK's Chartered Institute for IT, UBS and others.

Attendance is free. However, space is limited and we therefore encourage you to register online at your earliest convenience. Your participation will be confirmed.

We look forward to seeing you and your colleagues in Zurich!


Friday, April 29, 2011 5:18:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

<October 2011>

This Blog
Member Login
All Content © 2015, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton