Friday, March 13, 2009

Google Maps is a wonderfully useful tool for finding locations around town or around the globe.  Microsoft Live Labs Seadragon is a developer tool-kit for navigating wall-sized or larger displays using pan and zoom. Here’s the same basic tiled picture display technique (different implementation) applied to navigating the Linux kernel: Linux Kernel Map.


The kernel map has a component-by-component breakdown of the entire Linux kernel for hardware interfaces up to user space system calls and most of what is in between. And it’s all navigatable using zoom and pan. I’m not sure what I would actually use the kernel map for but it’s kind of cool.  If you could graphically zoom from the map to the source it might actually be a useful day-to-day tool rather than one a one-time thing.


Originally posted via Slashdot (Navigating the Linux Kernel like Google Maps) and sent my way by John Smiley of Amazon.




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Friday, March 13, 2009 5:36:54 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Thursday, March 12, 2009

Febuary 28th, Cloud Camp Seattle was held at an Amazon facility in Seattle. Cloud Camp is described organizers as an unconference where early adapters of Cloud Computing technologies exchange ideas. With the rapid change occurring in the industry, we need a place we can meet to share our experiences, challenges and solutions. At CloudCamp, you are encouraged you to share your thoughts in several open discussions, as we strive for the advancement of Cloud Computing. End users, IT professionals and vendors are all encouraged to participate.


The Cloud Camp schedule is at:


Jeanine Johnson attended the event and took excellent notes. Jeanine’s notes follow.


It began with a series of “lightening presentations” – 5 minute presentations on cloud topics that are now online ( Afterwards, there was a Q&A session with participants that volunteered to share their expertise. Then, 12 topics were chosen by popular vote to be discussed in an “open space” format, in which the volunteer who suggested the topic facilitated its 1 hour discussion.


Highlights from the lightening presentations:

·         AWS has launched several large data sets (10-220GB) in the cloud and made them publically available ( Example data sets are the human genome and US census data; large data sets that would take hours, days, or even weeks to download locally with a fast Internet connection.

·         A pyramid was drawn, with SaaS (e.g. Hotmail, SalesForce) on top, followed by PaaS (e.g. GoogleApp Engine, SalesForce API), IaaS (e.g. Amazon, Azure; which leverages virtualization), and “Traditional hosting” as the pyramid’s foundation, which was a nice and simple rendition of the cloud stack ( In addition, SaaS applications were shown to have more functionality, and traveling down that pyramid stack resulted in less functionality, but more flexibility.


Other than that info, the lightening presentations were too brief with no opportunity for Q&A to learn much. After the lightening presentations, open space discussions were held. I attended three: 1) scaling web apps, 2) scaling MySql, and 3) launching MMOGs (massively multiplayer online games) in the cloud – notes for each session follow.



One company volunteered themselves as a case study for the group of 20ish people. They run 30 physical servers, with 8 front-end Apache web servers on top of 1 scaled-up MySql database, and they use PHP channels to access their Drupal content. Their MySql machine has 16 processors and 32GB RAM, but is maxed-out and they’re having trouble scaling it because they currently hover around 30k concurrent connections, and up to 8x that during peak usage. They’re also bottlenecked by their NFS server, and used basic Round Robin for load balancing.


Using CloudFront was suggested, instead of Drupal (where they currently store lots of images). Unfortunately, CloudFront takes up to 24 hours to notice content changes, which wouldn’t work for them. So the discussion began around how to scale Drupal, but quickly morphed into key-value-pair storage systems (e.g. SimpleDB versus relational databases (e.g. MySql) to store backend data.


After some discussion around where business logic should reside, in StoredProcs and Triggers or in the code via an MVC paradigm, the group agreed that “you have to know your data: Do you need real-time consistency? Or eventual consistency?”


Hadoop was briefly discussed, but once someone said that popular web-development frameworks Rails and  Django steer folks towards relational databases, the discussion turned to scaling MySql. Best practice tips given to scale MySql were:

·         When scaling-up, memory becomes a bottleneck, so use memcach to extend your system’s lifespan.

·         Use MySql cluster

·         Use MySql proxy and shard your database, such that users are associated with a specific cluster (devs turn to sharding because horizontal scaling for WRITES isn’t as effective as it is for READS, aka replication processing becomes untenable).


Other open source technologies mentioned included:

·         Galary2, an open source photo album.

·         Jingle, Jabber-based VoIP technology.



Someone volunteered from the group of 10ish people to white-board the “ways to scale MySql,” which were:

·         Master / Slave, which can use Dolphin/Sakila, but  becomes inefficient around 8+ machines.

·         MySql proxy, and then replicate each machine behind the proxy.

·         Master : Master topology using sync replication.

·         Master ring topology using MySql proxy. It works well, and the replication overhead can be helped by adding more machine, but several thought it would be hard to implement this setup in the cloud.

·         Mesh topology (if you have the right hardware). This is how a lot of high-performance systems work, but recovery and management are hard.

·         Scale-up and run as few slaves as possible – some felt that this “simple” solution is what generally works best.


Someone then drew a “HA Druple Stack in the cloud,” which consisted of 3 front-end load balancers with hot-swap for failures to either the 2nd or 3rd machines, followed by 2 web-servers, 2 master/slave databases in the backend. If using Drupal, 2 additional NFS servers should be setup for static content storage with hot swap (aka fast Mac failover). However, it was recommended that Drupal be replaced with a CDN when the system begins to need scaling-up. This configuration in the Amazon cloud costs around $700 monthly to run (plus network traffic).


Memcach ( was mentioned as a possibility as well.



This topic was suggested by a game developer lead. He explained to the crowd of 10ish people that MMOs require persistent connections to servers, and their concurrent connections has a relatively high standard deviation daily, with a trend over the week that peaks around Saturday and Sunday. MMO producers must plan their capacity a couple months in advance of publishing their game. And since up to 50% of a MMO’s subscriber base is active on the first day, they usually end up with left-over capacity after launch, when active subscribers drop to 20% of their base and continue to dwindle down until the end of the game’s lifecycle. As a result, it’d be ideal to get MMOGs into the cloud, but no one in the room knew how to get around the latency induced by virtualization, which is too much for flashy MMOGs (although the 5%-ish perf-hit is fine for asynchronous or low-graphics games). On a side note, iGames was mentioned as a good way to market games.


Afterwards, those people that were left went to the Elysian on 1st for drinks, and continued their cloud discussions.


James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Thursday, March 12, 2009 5:06:17 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Tuesday, March 10, 2009

Whenever I see a huge performance number without the denominator, I shake my head.  It’s easy to get a big performance number on almost any dimension but what is far more difficult is getting a great work done per dollar. Performance alone is not interesting.


I’m super interested in flash SSDs and see great potential for SSDs in both client and server-side systems. But, our industry is somewhat hype driven. When I first started working with SSDs and their application to server workloads, many thought it was a crazy ideas pointing out that the write rates were poor and they would wear out in days.  The former has been fixed in Gen 2 devices and later was never true.  Now SSDs are climbing up the hype meter and I find myself arguing on the other side: they don’t solve all problems. I still see the same advantages I saw before but I keep seeing SSDs proposed for applications where they simply are not the best price/performing solution.


Rather than write the article about where SSDs are a poor choice, I wrote two articles on where they were a good one:

·         When SSDs Make Sense in Server Applications

·         When SSDs Make Sense in Client Applications


SSDs are really poor choices for large sequential workloads. If you want aggregate sequential bandwidth, disks deliver it far cheaper.


In this article and referenced paper (Microslice Servers), I argue in more detail why performance is a poor measure for servers on any dimension. It’s work done per dollar and work done per watt we should be measuring.


I recently came across a fun little video, Samsung SSD Awesomeness. It’s actually a Samsung SSD advertisement. Overall, the video is fun. It’s creative and sufficiently effective that I watched the entire thing and you might as well. Clearly it’s a win for Samsung.  However, the core technical premise is broken. What they are showing is that you can get 2 GB/s by RAID 24 SSDs together.  This is unquestionably true. However, we can get 2 GB/s by raiding together 17 Seagate Barracuda 7200.11 (big, cheap, slow hard drives) at considerably lower cost. The 24 SSDs will produce awe striking random I/O performance and not particularity interesting sequential performance.  24 SSDs is not the cheapest way to get 2GB/s of sequential I/O.


Both Samsung and Intel have excellent price performing SSDs and both can produce great random IOPS/$.  There are faster SSDs out there (e.g. FusionIO) but the Samsung and Intel components are better price/performers and that’s the metric that really matters. However, none of them are good price/performers on pure sequential workloads and yet that’s how I see them described and that’s the basis for many purchasing decisions. 


See Annual Fully Burdened Cost of Power for a quick analysis of when an SSD can be a win based upon power savings and IOPS/$.


Conclusion: If the workload is large and sequential, use a hard disk drive. If it’s hot and random, consider an SSD-based solution.




Thanks to Sean James of Microsoft for sending the video my way.


James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Tuesday, March 10, 2009 5:36:30 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
 Saturday, March 07, 2009

In the current ACM SIGCOMM Computer Communications Review, there is an article on data center networking, Cost of a Cloud: Research Problems in Data Center Networks by Albert Greenberg, David Maltz, Parveen Patel, and myself.


Abstract: The data centers used to create cloud services represent a significant investment in capital outlay and ongoing costs. Accordingly, we first examine the costs of cloud service data centers today. The cost breakdown reveals the importance of optimizing work completed per dollar invested. Unfortunately, the resources inside the data centers often operate at low utilization due to resource stranding and fragmentation. To attack this first problem, we propose (1) increasing network agility, and (2) providing appropriate incentives to shape resource consumption. Second, we note that cloud service providers are building out geo-distributed networks of data centers. Geo-diversity lowers latency to users and increases reliability in the presence of an outage taking out an entire site. However, without appropriate design and management, these geo-diverse data center networks can raise the cost of providing service. Moreover, leveraging geo-diversity requires services be designed to benefit from it. To attack this problem, we propose (1) joint optimization of network and data center resources, and (2) new systems and mechanisms for geo-distributing state.


Direct link to the paper:  (6 pages)




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Saturday, March 07, 2009 2:14:59 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, March 04, 2009

Yesterday Amazon Web Services announced availability of Windows and SQL Server under  Elastic Compute Cloud (EC2) in the European region.  Running in the EU is important for workloads that need to be near customers in that region or workloads that operate on data that needs to stay in region.  The AWS Management Console has been extended to support EC2 in the EU region.  The management council supports administration of Linux, Unix, and Windows systems under Elastic Compute Cloud as well as management of Elastic Block Store and Elastic IP. More details up at:


Also yesterday, Microsoft confirmed Windows Azure Cloud Software Set for Release This Year. The InformationWeek article reports that that Azure will be released by the end of the year and that SQL Server Data Services will include some relational database capabilities. Details are expected at MIX in Vegas this March.


The utility computing world continues to evolve incredibly quickly.




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Wednesday, March 04, 2009 6:01:49 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Tuesday, March 03, 2009

Earlier this evening I attended the Washington Technology Industry Association event  Scaling into the Cloud with Amazon Web Services. Adam Selipsky, VP of Amazon Web Services gave an overview of AWS and was followed by two AWS customers each of which talked about their services and how they use AWS. My rough notes follow.


Adam Selipsky, VP Amazon Web Services

·         490k registered developers

·         Amazon is primarily a technology company.

o   Started experimenting with web services in 2002

o   Each page in the Amazon retail web site makes calls to 200 to 300 services prior to rendering

·         AWS design principles:

o   Reliability

o   Scalability

o   Low-latency

o   Easy to use

o   Inexpensive

·         Enterprises have to provision to peak – 10 to 15% utilization is a pretty common number

·         Amazon web services:

o   Simple Storage Service, Elastic Compute Cloud, SimpleDB, CloudFront, SQS, Flexible Payment Service, & Mechanical Turk

·         SimpleDB: 80/20 rule – most customers don’t need much of the functionality of relational systems most of the time

·         What were the biggest surprises over the last three years:

o   Growth:

§  AWS Developers: 160k in 2006 to 490k in 2008

§  S3 Objects Stored:: 200m in 2006 to 40B in 2008

§  S3 Peak request rate: 70k/s

o   Diverse use cases: web site/app hosting, media distribution, storage, backup, disaster recovery, content delivery, HPC, & S/W Dev & Test

o   Diverse customers: Enterprise to well funded startups to individuals

o   Partners: IBM, Oracle, SalesForce, Capgemini, MySQL, Sun, & RedHat

·         Customer technology investment:

o   30% focused on business

o   70% focused on infrastructure

·         AWS offloads this investment in infrastructure and allows time and capital invested into your business rather than infrastructure.

o   Lowers costs

o   Faster to market

o   More efficient use of capital

·         Trends being seen by AWS:

o   Multiple services

o   Enterprise adoption

o   Masive atasets and large-scale parallel processing

o   Increased nee for support and transparency so customers know what’s happening in the infrastructure:

§  Service health dashboard

§  Premium developer support

o   Running more sophisticated software in AWS

·         Animoto case study

o   Steady state of about 50 EC2 instances

o   Within 3 days they spiked to 5000 EC2 instances


Smartsheet: Todd Fasullo

·         Not just an online spreadsheet.  Leverage the spreadsheet paradigm but focused on collaboration

·         Hybrid model AWS and private infrastructure

·         Use CloudFront CDN to get javascript and static content close to users

·         Benefits & savings:

o   S3: 5% of the cost of our initial projects from existing hosting provider

o   CloudFront: <1% cost of traditional CDN

o   No sales negotiations

Picnik and AWS: Mike Harrington

·         Photo-editing awesomeness

·         Built-in editor on flkr

·         Facebook application

·         About Picnik:

o   Founded in 2005

o   Based in Seattle

o   16 employees

o   No VC

·         Flash based application

·         9m unique visitors per month

·         Hybrid model where base load is internally provided and everything above base load is EC2 hosted.

·         Heavy use of S3


James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Tuesday, March 03, 2009 8:08:35 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Sunday, March 01, 2009

I collect postings on high-scale service architectures, scaling war stories, and interesting implementation techniques. For past postings see Scaling Web Sites and Scaling LinkedIn.


Last week Bret Taylor posted an interesting description of the FriendFeed backend storage architecture: How FriendFeed uses MySQL to store Schema-less Data.  Friendfeed faces a subset of what I’ve called the hairball problem. The short description of this issue is social networking sites need to be able to access per-user information both by user and also by searching for the data to find the users. For example, group membership. Sometimes we will want to find groups that X is a member of and other times we’ll want to find a given group and all users who are members of that group.  If we partition by user, one access pattern is supported well. If we partition by group, then the other works well.  The hairball problem shows up in many domains – I just focus on social networks as the problem is so common there --  see Scaling LinkedIn.


Common design patterns to work around the hairball are: 1) application maintained, asynchronous materialized views, 2) distributed in-memory caching of alternate search paths, and 3) central in-memory caching.  LinkedIn is a prototypical example of central in-memory caching. Facebook is the prototypical example of distributed in-memory caching using memcached.  And, FriendFeed is a good example of the first pattern, application maintained, async materialized views.


In Bret’s How FreindFeed uses MySQL to store Schema-less Data he describes how Friendfeed manages the hairball problem. Data is stored in primary table sharded over the farm.  The primary table can be efficiently accessed on whatever its key is. If you want access to the same data searching on a different dimension, they would have to search every shard individually. To avoid this, they create a secondary table with the appropriate search key where the “data” is just the primary key of the primary table.  To find entities with some secondary property, they search first the secondary table to get the qualifying entity ID and then fetch the entities from the primary table.


Primary and secondary tables are not updated atomically – that would require two phase commit the protocol Pat Helland jokingly refers to as the anti-availability protocol.  Since the primaries and secondary tables are not updated atomically, a secondary index may point to a primary that actually doesn’t qualify and some primaries that do quality may not be found if the secondary hasn’t yet been updated. The later is simply a reality of this technique and the application has to be tolerant of this short-time period data integrity anomaly. The former problem can be solved by reapplying the search predicate as a residual (a common RDBMS implementation technique).


The FriendFeed systems described in Bret Taylor’s post also addresses the schema change problem. Schema changes can disruptive and some RDBMS implement schema change incredibly inefficiently. This by the way, is completely unnecessary – the solution is well known – but bad implementations persist. The FriendFeed technique to deal with the schema change issue is arguably a bit heavy handed: they simply don’t show the schema to MySQL and, instead, use it as a key-value store where the values are either JSON objects or Python dictionaries.




Thanks to Dave Quick for pointing me to the FriendFeed posting.


James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Sunday, March 01, 2009 9:01:10 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Friday, February 27, 2009

Yesterday I presented Service Design Bets Practices at an internal Amazon talk series called Principals of Amazon. This talk series is very similar to the weekly Microsoft Enterprise Computing Series that I hosted for 8 years at Microsoft (also an internal series).  Ironically both series were started by Pat Helland who is now back at Microsoft. 


None of the talk content is Amazon internal so I posted the slides at: 


It’s an update of an earlier talk first presented at LISA 2007:

·         Talk:

·         Paper:




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Friday, February 27, 2009 6:25:33 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Thursday, February 26, 2009

Google has announced that the App Engine free quota resources will be reduced and pricing has been announced for greater-than-free tier usage. The reduction in free tier will be effective 90 days after the February 24th announcement and reduces CPU and bandwidth allocations by the following amounts:


·         CPU time free tier reduced to 6.4 hours/day from 46 hours/day

·         Bandwidth free tier reduced to 1 GB/day from 10 GB/day


Also announced February 24th is the charge structure for usage beyond the free-tier:

  • $0.10 per CPU core hour. This covers the actual CPU time an application uses to process a given request, as well as the CPU used for any Datastore usage.
  • $0.10 per GB bandwidth incoming, $0.12 per GB bandwidth outgoing. This covers traffic directly to/from users, traffic between the app and any external servers accessed using the URLFetch API, and data sent via the Email API.
  • $0.15 per GB of data stored by the application per month.
  • $0.0001 per email recipient for emails sent by the application



James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:

Thursday, February 26, 2009 6:41:29 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
 Wednesday, February 25, 2009

This morning Alyssa Henry, did the keynote at USENIX File and Storage Technology (FAST) Conference. Alyssa is General Manager of Amazon Simple Storage Service. Alyssa kicked off the talk by announcing that S3 now has 40B objects under management which is nearly 3x what was stored in S3 at this time last year. The remainder of the talk focuses first on design goals and then gets into techniques used.


Design goals:

·         Durability

·         Availability

·         Scalability

·         Security

·         Performance

·         Simplicity

·         Cost effectiveness


Techniques used:

·         Redundancy

·         Retry

·         Surge protection

·         Eventual consistency

·         Routine testing of failure modes

·         Diversity of s/w, h/w, & workloads

·         Data scrubbing

·         Monitoring

·         Auto-management


The talk:AlyssaHenry_FAST_Keynote.pdf (729.04 KB)

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:

Wednesday, February 25, 2009 11:34:19 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Monday, February 23, 2009

Building Scalable Web Apps with Google App Engine was presented by Brett Slatkin of Goolgle at Google I/O 2008. The link above points to the video but Todd Hoff of High Scalability summarized the presentation in a great post Numbers Everyone Should Know.


The talk mostly focused on the Google App Engine and how to use it. For example, Brett shows how to implement a scalable counter and (nearly) ordered comments using App Engine Megastore. For the former, shard the counter to get write scale and sum them on read. 


Also included in the presentation where some general rule of thumb from Jeff Dean of Google. Rules of Thumb are good because they tell us what to expect and, when we see something different, they tell us to pay attention and look more closely.  When we see an exception, either our rule of thumb has just been proven wrong and we learned something. Or the data we’re looking at is wrong and we need to dig deeper. Either one is worth noticing. I use Rules of Thumb all the time not as way of understanding the world (they are sometimes wrong) but as a way of knowing where to look more closely.


Check out Toldd’s post:




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Monday, February 23, 2009 6:13:41 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Sunday, February 22, 2009

Richard Jones of has compiled an excellent list of key-value stores in Anti-RDBMS: A list of key-value stores.


In this post, Richard looks at Project Voldemort, Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Cassandra, HBase and Hypertable. His conclusion for use is that Project Voldemort has the most promise with Scalaris being a close second and Dynomite is also interesting.




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Sunday, February 22, 2009 7:43:13 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Saturday, February 21, 2009

Back in the early 90’s I attended High Performance Transactions Systems for the first time. I loved it. It’s on the ocean just south of Monterey and some of the best in both industry and academia show up to attend the small, single tracked conference. It’s invitational and kept small so it can be interactive. There are lots of discussions during the sessions, everyone eats together, and debates & discussions rage into the night. It’s great.


The conference was originally created by Jim Gray and friends with a goal to break the 1,000 transaction/second barrier. At the time, a lofty goal.  Over the years it’s morphed into a general transaction processing and database conference and then again into a high-scale services get together. The sessions I mostly like today are from leaders from eBay, Amazon, Microsoft, Google, etc. talking about very high scale services and how they work.


The next HPTS is October 26 through 28, 2009 and I’ll be there again this year: Consider attending, it’s a great conference.




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:

Saturday, February 21, 2009 8:01:00 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Thursday, February 19, 2009

Earlier today I presented Where Does the Power Go and What to do About it at the Western Washington Chapter of AFCOM. I basically presented the work I wrote up in the CIDR paper: The Case for Low-Cost, Low-Power Servers.


The slides are at: JamesHamilton_AFCOM2009.pdf (1.22 MB).


The general thesis of the talk is that improving data center efficiency by a factor of 4 to 5 is well within reach without substantial innovation or design risk.




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Thursday, February 19, 2009 4:56:49 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Sunday, February 15, 2009

Service billing is hard. It’s hard to get invoicing and settlement overheads low.  And billing is often one of the last and least thought of components of a for-fee online service systems. Billing at low overhead and high scale takes engineering and this often doesn’t get attention until after the service beta period. During a service beta period, you really don’t want to be only working out the service kinks.  If you have a for-fee service or up-sell, then you should be beta testing the billing system and the business model at the same time as you beta test the service itself. It’s hard to get all three right, so get all three into beta testing as early as possible.


Billing being hard is not new news.  The first notable internet service billing issue I recall was back in 1997 ( during which MSN was unable to scale the billing system and collect from users. Services weren’t interrupted but revenue certainly was.  Losses at the time where estimated to be more than $22m.


One way to solve the problem of efficient, reliable, and low-overhead billing is to use a service that specializes in billing. It was recently announced that Microsoft Online Services (includes Exchange Online, Sharepoint Online, Office communicator online, and Office Live Meeting) has decided to use Metratech  as billing and partner settlement system. The scope of partnership and whether it includes all geographies is not clear from the press release: Microsoft Online Services Utilizes MetraTech’s Billing and Partner Settlement Solution


I suspect we’ll see more and more sub-service categories popping up over time and the pure own-the-entire stack, vertically integrated services model will only be used by the very largest services.




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Sunday, February 15, 2009 6:45:16 PM (Pacific Standard Time, UTC-08:00)  #    Comments [12] - Trackback
 Friday, February 13, 2009

Patterson, Katz, and the rest of the research team from Berkeley have an uncanny way of spotting a technology trend or opportunity early.  Redundant Array of Inexpensive Disk (RAID) and Reduced Instruction Set Computing (RISC) are two particularly notable research contributions from this team amongst numerous others.  Yesterday, the Berkeley Reliable, Adaptable, Distributed Systems Lab published Above the Clouds: A Berkeley View of Cloud Computing.


The paper argues that the time has come for utility computing and the move to the clouds will be driven by large economies of scale, the illusion of near infinite resources available on demand, the conversion of capital expense to operational expense, the ability to use resources for short periods of time, and the savings possible by statistically multiplexing a large and diverse workload population.


Paper: Above the Clouds:




If I were running an IT shop today, whether at a startup or a large enterprise, I would absolutely have some of my workloads running in the cloud. This paper is worth reading and understanding.




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Friday, February 13, 2009 7:04:32 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Thursday, February 12, 2009

Yesterday, IBM announced it is offering access to IBM Software in the Amazon Web Services Cloud. IBM products now offered for use in the Amazon EC2 environment include:

  • DB2 Express-C 9.5
  • Informix Dynamic Server Developer Edition 11.5
  • WebSphere Portal Server and Lotus Web Content Management Standard Edition
  • WebSphere sMash

The IBM approach to utility computing offers considerable licensing flexibility with three models: 1) Development AMIs (Amazon Machine Image), 2) Production AMIs, and 3) Bring your own license.

Development AMIs  are available today for testing, education, development, demonstration, and other non-commercial uses.  Development AMIs are available from IBM today at no cost beyond the standard Amazon EC2 charges.

Production AMIs are available for production commercial application use with pay-as-you-go pricing allowing the purchase of these software offerings by the hour.

Bring your own License: Some existing IBM on-premise licenses can be used in Amazon EC2. See  PVUs required for Amazon Elastic Compute Cloud for more detail.

The IBM offering of buy-the-hour software pricing with the Production AMIs is 100% the right model for customers and it is where I expect the utility computing world as a whole will end up fairly quickly. Pay-as-you-go, hourly pricing is the model that offers customers the most flexibility where software and infrastructure costs scale in near real-time with usage.

I like the bring your own license model in that it supports moving workload back and forth between on-premise and the cloud, and supports moving portions of an enterprise IT infrastructure to utility computing with less licensing complexity and less friction.

More data from IBM at the DeveloperWorks Cloud Computing Resource Center and from Amazon at IBM and AWS.

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:

Thursday, February 12, 2009 7:08:03 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, February 11, 2009

Over the years, I’ve noticed that most DoS attacks are actually friendly fire. Many times I’ve gotten calls from our Ops Manager saying the X data center is under heavy attack and we’re rerouting traffic to the Y DC  only later to learn that the “attack” was actually a mistake on our end.  There is no question that there are bad guys out there sourcing attacks but internal sources of network overrun are far more common.


Yesterday, kdawson posted a wonderful example on Slashdot from Source Forge Chief Network Engineer Uriah Welcome titled “from the disturbances in the fabric department”:


Excepted from the post: was unreachable for about 75 minutes this evening. What we had was indeed a DoS, however it was not externally originating. What I saw was a massive amount of traffic going across the core switches; by massive I mean 40 Gbit/sec. Through the process of elimination I was finally able to isolate the problem down to a pair of switches. I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something — I just don't know what yet


As in all things software related, it’s best to start with the assumption that it’s your fault and proceed with diagnosis on that basis until proven otherwise.


Thanks to Patrick Niemeyer for sending this one my way.


James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Wednesday, February 11, 2009 5:37:32 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Tuesday, February 10, 2009

Microsoft has announced the delay of Chicago and Dublin earlier this week (Microsoft will open Dublin and Chicago Data Centers as Customer Demand Warrants.  A few weeks ago the Des Moines data center delay was announced (  Arne Josefsberg and Mike Manos announced these delays in there Building a Better Mousetrap, a.k.a. Optimizing for Maximum efficiency in an Economic Downturn blog posting.


This is a good, fiscally responsible decision given the current tough economic conditions.  It’s the right time to be slowing down infrastructure investments. But, what surprises me is the breadth of the spread between planned expansion and the currently expected Microsoft internal demand.  That’s at least surprising and bordering on amazing. Let’s look more closely. Chicago has been estimated to be in the 60MW range (30MW to 88MW for the half of the facility that is containerized): First Containerized Data Center Announcement.  Des Moines was announced to be a $500MW facility ( I’m assuming that  number is both infrastructure and IT equipment so , taking the servers out, would make it roughly a $200M investment.  That would make it a roughly 15MW critical load facility. Dublin was announced as a $500M facility as well ( so, using the same logic, it’ll be at or very near 15MW of critical load.


That means that a booming 90MW of facilities critical load have been delayed over the last 30 days. That is a prodigious difference between planned supply and realized demand.  I’ve long said that capacity planning was somewhere between a black art and pure voodoo and this is perhaps the best example I’ve seen so far.


We all knew that the tough economy was going to impact all aspects of the services world and the Microsoft announcement is a wake-up call for all of to stare hard at our planned infrastructure investments and capacity plans and make sure they are credible. I suspect we’re heading into another period like post-2000 when data center capacity is widely available and prices are excellent. Hats off to Mike and Arne from Microsoft for continuing to be open and sharing their decisions broadly. It’s good for the industry.


Across the board, we all need to be looking hard at our build-out schedules.




James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Tuesday, February 10, 2009 6:47:31 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
 Saturday, February 07, 2009

Last July, Facebook released Cassandra to open source under the Apache license: Facebook Releases Cassandra as Open Source.  Facebook uses Cassandra as email search system where, as of last summer, they had 25TB and over 100m mailboxes. This video gets into more detail on the architecture and design: My notes are below if you don’t feel like watching the video.

·         Authors:

o   Prashant Malik

o   Karthnik Ranganathan

o   Avinash Lakshman

·         Structured storage system over P2p (keys are consistent hashed over servers)

·         Initially aimed at email inbox search problem

·         Design goals:

o   Cost Effective

o   Highly Available

o   Incrementally Scalable

o   Efficient data layout

o   Minimal administration

·         Why Cassandra

o   MySQL drives too many random I/Os

o   File-based solutions require far too many locks

·         What is Cassandra

o   Structured storage over a distributed cluster

o   Redundancy via replication

o   Supports append/insert without reads

o   Supports a caching layer

o   Supports Hadoop operations

·         Cassandra Architecture

o   Core Cassandra Services:

§  Messaging (async, non-blocking)

§  Failure detector

§  Cluster membership

§  Partitioning scheme

§  Replication strategy

o   Cassandra Middle Layer

§  Commit log

§  Mem-table

§  Compactions

§  Hinted handoff

§  Read repair

§  Bootstrap

o   Cassandra Top Layer

§  Key, block, & column indexes

§  Read consistency

§  Touch cache

§  Cassandra API

§  Admin API

§  Read Consistency

o   Above the top layer:

§  Tools

§  Hadoop integration

§  Search API and Routing

·         Cassandra Data Model

o   Key (uniquely specifies a “row”)

§  Any arbitrary string

o   Column families are declared or deleted in advance by administrative action

§  Columns can be added or deleted dynamically

§  Column families have attribute:

·         Name: arbitrary string

·         Type: simple,

o   Key can “contain” multiple column families

§  No requirement that two keys have any overlap in columns

o   Columns can be added or removed arbitrarily from column families

o   Columns:

§  Name: arbitrary string

§  Value: non-indexed blob

§  Timestamp (client provided)

o   Column families have sort orders

§  Time-based sort or name-based sort

o   Super-column families:

§  Big tables calls them locality groups

§  Super-column families have a sort order

§  Essentially a multi-column index

o   System column families

§  For internal use by Cassandra

o   Example from email application

§  Mail-list (sorted by name)

·         All mail that includes a given word

§  Thread-list (sorted by time)

·         All threads that include a given word

§  User-list (sorted by time)

·         All mail that includes a given word user

·         Cassandra API

o   Simple get/put model

·         Write model:

o   Quorum write or aysnc mode (used by email application)

o   Async: send request to any node

§  That node will push the data to appropriate nodes but return to client immediately

o   Quorum write:

§  Blocks until quorum is reached

o   If node down, then write to another node with a hint saying where it should be written two

§  Harvester every 15 min goes through and find hints and moves the data to the appropriate node

o   At write time, you first write to a commit log (sequential)

§  After write to log it is sent to the appropriate nodes

§  Each node receiving write first records it in a local log

·         Then makes update to appropriate memtables (1 for each column family)

§  Memtables are flushed to disk when:

·         Out of space

·         Too many keys (128 is default)

·         Time duration (client provided – no cluster clock)

§  When memtables written out two files go out:

·         Data File

·         Index File

o   Key, offset pairs (points into data file)

o   Bloom filter (all keys in data file)

§  When a commit log has had all its column families pushed to disk, it is deleted

·         Data files accumulate over time.  Periodically data files are merged sorted into a new file (and creates new index)

·         Write properties:

o   No locks in critical path

o   Sequential disk access only

o   Behaves like a write through cache

§  If you read from the same node, you see your own writes.  Doesn’t appear to provide any guarantee on read seeing latest change in failure case

o   Atomicity guaranteed for a key

o   Always writable

·         Read Path:

o   Connect to any node

o   That node will route to the closes data copy which services immediately

o   If high consistency required, don’t return from local immediately

§  First send digest request to all replicas

§  If delta is found, the updates are sent to the nodes that don’t have current data (read repair)

·         Replication supported via multiple consistent hash rings:

o   Servers are hashed over ring

o   Keys are hashed over ring

o   Redundancy via walking around the ring and placing on the next node (rack position unaware) or on the next node on a different rack (rack aware) or on a next system in a different data center (implication being that the ring can span data centers)

·         Cluster membership

o   Cluster membership and failure detection via gossip protocol

·         Accrual failure detector

o   Default sets PHI to 5 in Cassandra

o   Detection is 10 to 15 seconds with PHI=5

·         UDP control messages and TCP for data messages

·         Complies with Staged Event Driven Architecture (SEDA)

·         Email system:

o   100m users

o   4B threads

o   25TB with 3x replication

o   Uses and joins across 4 tables:

§  Mailbox (user_id to thread_id mapping)

§  Msg_threads (thread to subject mapping)

§  Msg_store (thread to message mapping)

§  Info (user_id to user name mapping)

·         Able to load using Hadoop at 1.5TB/hour

o   Can load 25TB at network bandwidth over Cassandra Cluster


James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Saturday, February 07, 2009 11:16:20 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

<March 2009>

This Blog
Member Login
All Content © 2015, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton