Thursday, February 26, 2009

Google has announced that the App Engine free quota resources will be reduced and pricing has been announced for greater-than-free tier usage. The reduction in free tier will be effective 90 days after the February 24th announcement and reduces CPU and bandwidth allocations by the following amounts:

 

·         CPU time free tier reduced to 6.4 hours/day from 46 hours/day

·         Bandwidth free tier reduced to 1 GB/day from 10 GB/day

 

Also announced February 24th is the charge structure for usage beyond the free-tier:

  • $0.10 per CPU core hour. This covers the actual CPU time an application uses to process a given request, as well as the CPU used for any Datastore usage.
  • $0.10 per GB bandwidth incoming, $0.12 per GB bandwidth outgoing. This covers traffic directly to/from users, traffic between the app and any external servers accessed using the URLFetch API, and data sent via the Email API.
  • $0.15 per GB of data stored by the application per month.
  • $0.0001 per email recipient for emails sent by the application

--jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

Thursday, February 26, 2009 6:41:29 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Services
 Wednesday, February 25, 2009

This morning Alyssa Henry, did the keynote at USENIX File and Storage Technology (FAST) Conference. Alyssa is General Manager of Amazon Simple Storage Service. Alyssa kicked off the talk by announcing that S3 now has 40B objects under management which is nearly 3x what was stored in S3 at this time last year. The remainder of the talk focuses first on design goals and then gets into techniques used.

 

Design goals:

·         Durability

·         Availability

·         Scalability

·         Security

·         Performance

·         Simplicity

·         Cost effectiveness

 

Techniques used:

·         Redundancy

·         Retry

·         Surge protection

·         Eventual consistency

·         Routine testing of failure modes

·         Diversity of s/w, h/w, & workloads

·         Data scrubbing

·         Monitoring

·         Auto-management

 

The talk:AlyssaHenry_FAST_Keynote.pdf (729.04 KB)

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

Wednesday, February 25, 2009 11:34:19 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Monday, February 23, 2009

Building Scalable Web Apps with Google App Engine was presented by Brett Slatkin of Goolgle at Google I/O 2008. The link above points to the video but Todd Hoff of High Scalability summarized the presentation in a great post Numbers Everyone Should Know.

 

The talk mostly focused on the Google App Engine and how to use it. For example, Brett shows how to implement a scalable counter and (nearly) ordered comments using App Engine Megastore. For the former, shard the counter to get write scale and sum them on read. 

 

Also included in the presentation where some general rule of thumb from Jeff Dean of Google. Rules of Thumb are good because they tell us what to expect and, when we see something different, they tell us to pay attention and look more closely.  When we see an exception, either our rule of thumb has just been proven wrong and we learned something. Or the data we’re looking at is wrong and we need to dig deeper. Either one is worth noticing. I use Rules of Thumb all the time not as way of understanding the world (they are sometimes wrong) but as a way of knowing where to look more closely.

 

Check out Toldd’s post: http://highscalability.com/numbers-everyone-should-know.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Monday, February 23, 2009 6:13:41 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Sunday, February 22, 2009

Richard Jones of Last.fm has compiled an excellent list of key-value stores in Anti-RDBMS: A list of key-value stores.

 

In this post, Richard looks at Project Voldemort, Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Cassandra, HBase and Hypertable. His conclusion for Last.fm use is that Project Voldemort has the most promise with Scalaris being a close second and Dynomite is also interesting.

 

                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Sunday, February 22, 2009 7:43:13 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Saturday, February 21, 2009

Back in the early 90’s I attended High Performance Transactions Systems for the first time. I loved it. It’s on the ocean just south of Monterey and some of the best in both industry and academia show up to attend the small, single tracked conference. It’s invitational and kept small so it can be interactive. There are lots of discussions during the sessions, everyone eats together, and debates & discussions rage into the night. It’s great.

 

The conference was originally created by Jim Gray and friends with a goal to break the 1,000 transaction/second barrier. At the time, a lofty goal.  Over the years it’s morphed into a general transaction processing and database conference and then again into a high-scale services get together. The sessions I mostly like today are from leaders from eBay, Amazon, Microsoft, Google, etc. talking about very high scale services and how they work.

 

The next HPTS is October 26 through 28, 2009 and I’ll be there again this year: http://www.eecs.harvard.edu/~margo/HPTS/cfp.html. Consider attending, it’s a great conference.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

Saturday, February 21, 2009 8:01:00 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Thursday, February 19, 2009

Earlier today I presented Where Does the Power Go and What to do About it at the Western Washington Chapter of AFCOM. I basically presented the work I wrote up in the CIDR paper: The Case for Low-Cost, Low-Power Servers.

 

The slides are at: JamesHamilton_AFCOM2009.pdf (1.22 MB).

 

The general thesis of the talk is that improving data center efficiency by a factor of 4 to 5 is well within reach without substantial innovation or design risk.

 

                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Thursday, February 19, 2009 4:56:49 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Sunday, February 15, 2009

Service billing is hard. It’s hard to get invoicing and settlement overheads low.  And billing is often one of the last and least thought of components of a for-fee online service systems. Billing at low overhead and high scale takes engineering and this often doesn’t get attention until after the service beta period. During a service beta period, you really don’t want to be only working out the service kinks.  If you have a for-fee service or up-sell, then you should be beta testing the billing system and the business model at the same time as you beta test the service itself. It’s hard to get all three right, so get all three into beta testing as early as possible.

 

Billing being hard is not new news.  The first notable internet service billing issue I recall was back in 1997 (http://news.cnet.com/MSN-faces-billing-problem/2100-1023_3-230402.html?tag=mncol) during which MSN was unable to scale the billing system and collect from users. Services weren’t interrupted but revenue certainly was.  Losses at the time where estimated to be more than $22m.

 

One way to solve the problem of efficient, reliable, and low-overhead billing is to use a service that specializes in billing. It was recently announced that Microsoft Online Services (includes Exchange Online, Sharepoint Online, Office communicator online, and Office Live Meeting) has decided to use Metratech  as billing and partner settlement system. The scope of partnership and whether it includes all geographies is not clear from the press release: Microsoft Online Services Utilizes MetraTech’s Billing and Partner Settlement Solution

 

I suspect we’ll see more and more sub-service categories popping up over time and the pure own-the-entire stack, vertically integrated services model will only be used by the very largest services.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Sunday, February 15, 2009 6:45:16 PM (Pacific Standard Time, UTC-08:00)  #    Comments [12] - Trackback
Services
 Friday, February 13, 2009

Patterson, Katz, and the rest of the research team from Berkeley have an uncanny way of spotting a technology trend or opportunity early.  Redundant Array of Inexpensive Disk (RAID) and Reduced Instruction Set Computing (RISC) are two particularly notable research contributions from this team amongst numerous others.  Yesterday, the Berkeley Reliable, Adaptable, Distributed Systems Lab published Above the Clouds: A Berkeley View of Cloud Computing.

 

The paper argues that the time has come for utility computing and the move to the clouds will be driven by large economies of scale, the illusion of near infinite resources available on demand, the conversion of capital expense to operational expense, the ability to use resources for short periods of time, and the savings possible by statistically multiplexing a large and diverse workload population.

 

Paper: Above the Clouds: http://d1smfj0g31qzek.cloudfront.net/abovetheclouds.pdf

Presentation: http://d1smfj0g31qzek.cloudfront.net/above_the_clouds.ppt.pdf

Video: http://www.youtube.com/watch?v=IJCxqoh5ep4

 

If I were running an IT shop today, whether at a startup or a large enterprise, I would absolutely have some of my workloads running in the cloud. This paper is worth reading and understanding.

 

                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Friday, February 13, 2009 7:04:32 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Thursday, February 12, 2009

Yesterday, IBM announced it is offering access to IBM Software in the Amazon Web Services Cloud. IBM products now offered for use in the Amazon EC2 environment include:

  • DB2 Express-C 9.5
  • Informix Dynamic Server Developer Edition 11.5
  • WebSphere Portal Server and Lotus Web Content Management Standard Edition
  • WebSphere sMash

The IBM approach to utility computing offers considerable licensing flexibility with three models: 1) Development AMIs (Amazon Machine Image), 2) Production AMIs, and 3) Bring your own license.

Development AMIs  are available today for testing, education, development, demonstration, and other non-commercial uses.  Development AMIs are available from IBM today at no cost beyond the standard Amazon EC2 charges.

Production AMIs are available for production commercial application use with pay-as-you-go pricing allowing the purchase of these software offerings by the hour.

Bring your own License: Some existing IBM on-premise licenses can be used in Amazon EC2. See  PVUs required for Amazon Elastic Compute Cloud for more detail.

The IBM offering of buy-the-hour software pricing with the Production AMIs is 100% the right model for customers and it is where I expect the utility computing world as a whole will end up fairly quickly. Pay-as-you-go, hourly pricing is the model that offers customers the most flexibility where software and infrastructure costs scale in near real-time with usage.

I like the bring your own license model in that it supports moving workload back and forth between on-premise and the cloud, and supports moving portions of an enterprise IT infrastructure to utility computing with less licensing complexity and less friction.

More data from IBM at the DeveloperWorks Cloud Computing Resource Center and from Amazon at IBM and AWS.

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

Thursday, February 12, 2009 7:08:03 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Wednesday, February 11, 2009

Over the years, I’ve noticed that most DoS attacks are actually friendly fire. Many times I’ve gotten calls from our Ops Manager saying the X data center is under heavy attack and we’re rerouting traffic to the Y DC  only later to learn that the “attack” was actually a mistake on our end.  There is no question that there are bad guys out there sourcing attacks but internal sources of network overrun are far more common.

 

Yesterday, kdawson posted a wonderful example on Slashdot from Source Forge Chief Network Engineer Uriah Welcome titled “from the disturbances in the fabric department”:http://news.slashdot.org/article.pl?sid=09/02/10/044221.

 

Excepted from the post: Slashdot.org was unreachable for about 75 minutes this evening. What we had was indeed a DoS, however it was not externally originating. What I saw was a massive amount of traffic going across the core switches; by massive I mean 40 Gbit/sec. Through the process of elimination I was finally able to isolate the problem down to a pair of switches. I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something — I just don't know what yet

 

As in all things software related, it’s best to start with the assumption that it’s your fault and proceed with diagnosis on that basis until proven otherwise.

 

Thanks to Patrick Niemeyer for sending this one my way.

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Wednesday, February 11, 2009 5:37:32 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Tuesday, February 10, 2009

Microsoft has announced the delay of Chicago and Dublin earlier this week (Microsoft will open Dublin and Chicago Data Centers as Customer Demand Warrants.  A few weeks ago the Des Moines data center delay was announced (http://www.canadianbusiness.com/markets/market_news/article.jsp?content=D95T2TRG0).  Arne Josefsberg and Mike Manos announced these delays in there Building a Better Mousetrap, a.k.a. Optimizing for Maximum efficiency in an Economic Downturn blog posting.

 

This is a good, fiscally responsible decision given the current tough economic conditions.  It’s the right time to be slowing down infrastructure investments. But, what surprises me is the breadth of the spread between planned expansion and the currently expected Microsoft internal demand.  That’s at least surprising and bordering on amazing. Let’s look more closely. Chicago has been estimated to be in the 60MW range (30MW to 88MW for the half of the facility that is containerized): First Containerized Data Center Announcement.  Des Moines was announced to be a $500MW facility (http://www.datacenterknowledge.com/archives/2009/01/23/microsoft-postpones-iowa-data-center/). I’m assuming that  number is both infrastructure and IT equipment so , taking the servers out, would make it roughly a $200M investment.  That would make it a roughly 15MW critical load facility. Dublin was announced as a $500M facility as well (http://www.datacenterknowledge.com/archives/2007/05/16/microsoft-plans-500m-dublin-data-center/) so, using the same logic, it’ll be at or very near 15MW of critical load.

 

That means that a booming 90MW of facilities critical load have been delayed over the last 30 days. That is a prodigious difference between planned supply and realized demand.  I’ve long said that capacity planning was somewhere between a black art and pure voodoo and this is perhaps the best example I’ve seen so far.

 

We all knew that the tough economy was going to impact all aspects of the services world and the Microsoft announcement is a wake-up call for all of to stare hard at our planned infrastructure investments and capacity plans and make sure they are credible. I suspect we’re heading into another period like post-2000 when data center capacity is widely available and prices are excellent. Hats off to Mike and Arne from Microsoft for continuing to be open and sharing their decisions broadly. It’s good for the industry.

 

Across the board, we all need to be looking hard at our build-out schedules.

 

                                                                -jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Tuesday, February 10, 2009 6:47:31 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Services
 Saturday, February 07, 2009

Last July, Facebook released Cassandra to open source under the Apache license: Facebook Releases Cassandra as Open Source.  Facebook uses Cassandra as email search system where, as of last summer, they had 25TB and over 100m mailboxes. This video gets into more detail on the architecture and design: http://www.new.facebook.com/video/video.php?v=540974400803#/video/video.php?v=540974400803. My notes are below if you don’t feel like watching the video.

·         Authors:

o   Prashant Malik

o   Karthnik Ranganathan

o   Avinash Lakshman

·         Structured storage system over P2p (keys are consistent hashed over servers)

·         Initially aimed at email inbox search problem

·         Design goals:

o   Cost Effective

o   Highly Available

o   Incrementally Scalable

o   Efficient data layout

o   Minimal administration

·         Why Cassandra

o   MySQL drives too many random I/Os

o   File-based solutions require far too many locks

·         What is Cassandra

o   Structured storage over a distributed cluster

o   Redundancy via replication

o   Supports append/insert without reads

o   Supports a caching layer

o   Supports Hadoop operations

·         Cassandra Architecture

o   Core Cassandra Services:

§  Messaging (async, non-blocking)

§  Failure detector

§  Cluster membership

§  Partitioning scheme

§  Replication strategy

o   Cassandra Middle Layer

§  Commit log

§  Mem-table

§  Compactions

§  Hinted handoff

§  Read repair

§  Bootstrap

o   Cassandra Top Layer

§  Key, block, & column indexes

§  Read consistency

§  Touch cache

§  Cassandra API

§  Admin API

§  Read Consistency

o   Above the top layer:

§  Tools

§  Hadoop integration

§  Search API and Routing

·         Cassandra Data Model

o   Key (uniquely specifies a “row”)

§  Any arbitrary string

o   Column families are declared or deleted in advance by administrative action

§  Columns can be added or deleted dynamically

§  Column families have attribute:

·         Name: arbitrary string

·         Type: simple,

o   Key can “contain” multiple column families

§  No requirement that two keys have any overlap in columns

o   Columns can be added or removed arbitrarily from column families

o   Columns:

§  Name: arbitrary string

§  Value: non-indexed blob

§  Timestamp (client provided)

o   Column families have sort orders

§  Time-based sort or name-based sort

o   Super-column families:

§  Big tables calls them locality groups

§  Super-column families have a sort order

§  Essentially a multi-column index

o   System column families

§  For internal use by Cassandra

o   Example from email application

§  Mail-list (sorted by name)

·         All mail that includes a given word

§  Thread-list (sorted by time)

·         All threads that include a given word

§  User-list (sorted by time)

·         All mail that includes a given word user

·         Cassandra API

o   Simple get/put model

·         Write model:

o   Quorum write or aysnc mode (used by email application)

o   Async: send request to any node

§  That node will push the data to appropriate nodes but return to client immediately

o   Quorum write:

§  Blocks until quorum is reached

o   If node down, then write to another node with a hint saying where it should be written two

§  Harvester every 15 min goes through and find hints and moves the data to the appropriate node

o   At write time, you first write to a commit log (sequential)

§  After write to log it is sent to the appropriate nodes

§  Each node receiving write first records it in a local log

·         Then makes update to appropriate memtables (1 for each column family)

§  Memtables are flushed to disk when:

·         Out of space

·         Too many keys (128 is default)

·         Time duration (client provided – no cluster clock)

§  When memtables written out two files go out:

·         Data File

·         Index File

o   Key, offset pairs (points into data file)

o   Bloom filter (all keys in data file)

§  When a commit log has had all its column families pushed to disk, it is deleted

·         Data files accumulate over time.  Periodically data files are merged sorted into a new file (and creates new index)

·         Write properties:

o   No locks in critical path

o   Sequential disk access only

o   Behaves like a write through cache

§  If you read from the same node, you see your own writes.  Doesn’t appear to provide any guarantee on read seeing latest change in failure case

o   Atomicity guaranteed for a key

o   Always writable

·         Read Path:

o   Connect to any node

o   That node will route to the closes data copy which services immediately

o   If high consistency required, don’t return from local immediately

§  First send digest request to all replicas

§  If delta is found, the updates are sent to the nodes that don’t have current data (read repair)

·         Replication supported via multiple consistent hash rings:

o   Servers are hashed over ring

o   Keys are hashed over ring

o   Redundancy via walking around the ring and placing on the next node (rack position unaware) or on the next node on a different rack (rack aware) or on a next system in a different data center (implication being that the ring can span data centers)

·         Cluster membership

o   Cluster membership and failure detection via gossip protocol

·         Accrual failure detector

o   Default sets PHI to 5 in Cassandra

o   Detection is 10 to 15 seconds with PHI=5

·         UDP control messages and TCP for data messages

·         Complies with Staged Event Driven Architecture (SEDA)

·         Email system:

o   100m users

o   4B threads

o   25TB with 3x replication

o   Uses and joins across 4 tables:

§  Mailbox (user_id to thread_id mapping)

§  Msg_threads (thread to subject mapping)

§  Msg_store (thread to message mapping)

§  Info (user_id to user name mapping)

·         Able to load using Hadoop at 1.5TB/hour

o   Can load 25TB at network bandwidth over Cassandra Cluster

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Saturday, February 07, 2009 11:16:20 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Software
 Thursday, January 29, 2009

I did the final day keynote at the Conference on Innovative Data Systems Research earlier this month.  The slide deck is based upon the CEMS paper: The Case for Low-Cost, Low-Power Servers but it also included a couple of techniques I’ve talked about before that I think are super useful:

·         Power Load Management: The basic idea is to oversell power, the most valuable resource in a data center.  Just as airlines oversell seats, there revenue producing asset. Rather than taking the data center critical power (total power less power distribution losses and mechanical loads) and then risking it down by 10 to 20% to play it safe since utility over-draw brings high cost. Servers are then provisioned to this risked down critical power level. But, the key point is that almost no data center is ever anywhere close to 100% utilized (or even close to 50% for that matter but that’s another discussion) so there is close to zero chance that all servers will draw their full load.  And, with some diversity of workloads, even with some services spiking to 100%, we can often exploit the fact that peak loads across dissimilar services are not fully correlated. On this understanding, we can provision more servers than we have critical power. This idea was originally proposed by Xiabo Fan, Wolf Weber, and Luiz Barroso (all of Google) in Power Provisioning in a Warehouse-Sized Computer. It’s a great paper.

·         Resource Consumption Shaping: This is an extension to the idea above of applying yield management techniques to power and instead applying to all resources in the data center. The key observation here is that nearly all resources in a data center are billed at peak.  Power, Networking, Servers counts, etc.  It all bills at peak. So we can play two fairly powerful tricks: 1) exploiting workload heterogeneity and over-subscribing all resources just as we did with power in Power Load Management above, and 2) move peaks to valleys to further reduce costs and exploit the fact that the resource valleys are effectively free. This is an idea that Dave Treadwell and I came up with a couple of years back and it’s written up in more detail in Resource Consumption Shaping.

 

The slide deck I presented at the CIDR conference are at: http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_CIDR2009.pdf.

 

--jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Thursday, January 29, 2009 5:58:33 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Tuesday, January 27, 2009

Wow, 2TB for $250 from Western Digital: http://www.engadget.com/2009/01/26/western-digitals-2tb-caviar-green-hdd-on-sale-in-australia/. Once its shipping in North America, I’ll have to update The Cost of Bulk Cold Storage.


Update: Released in the US at $299: Western Digital's 2TB Caviar Green hard drive launches, gets previewed.

 

Sent my way by Savas Parastatididis.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Tuesday, January 27, 2009 5:28:33 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Hardware
 Sunday, January 25, 2009

In Microslice Servers and the Case for Low-Cost, Low-Power Servers, I observed that CPU bandwidth is outstripping memory bandwidth. Server designers can address this by: 1) designing better memory subsystems or 2) reducing the CPU per-server.  Optimizing for work done per dollar and work done per joule argues strongly for the second approach for many workloads.

 

In Low Power Amdahl Blades for Data Intensive Computing (Amdahl Blades-V3.pdf (84.25 KB)), Alex Szalay makes a related observation and arrives at a similar point.  He argues that server I/O requirements for data intensive computing clusters grow in proportion to CPU performance. As per-server CPU performance continues to increase, we need to add additional I/O capability to each server.  We can add more disks but this drives up both power and cost as more disk require more I/O channels. Another approach is use a generation 2 flash SSDs such as the Intel X25-E or the OCZ Vertex (I’m told the Samsung 2.06Gb/s (SLC) is also excellent but I’ve not yet seen their random write IOPS rates). Both the OCZ and the Intel components are excellent performers nearing FusionIO but at a far better price point making them considerably superior in work done per dollar.

 

The Szalay paper looks first at the conventional approach of adding flash SSDs to a high-end server. To get the required I/O rates, three high-performance SSDs would be needed.  But, to get full I/O rates from the three devices, three I/O channels would be needed which drives up power and cost. What if we head the other way and, rather than scaling up the I/O sub-system, we scale down the CPU per server? Alex shows that a low-power, low-cost commodity board coupled with a single, high-performance flash SSDs would form an excellent building block for a data intensive cluster. It’s a very similar direction to CEMS servers but applied to data intensive workloads.

 

One of the challenges of low-power, high-density servers along the lines proposed by Alex and I is network cabling.  With CEMS there are 240 servers/rack and a single top-of-rack switch is inadequate so we go with a mini-switch per six-server tray and each of 40 trays connected to a top-of-rack switch.  The Low Power Amdahl Blades are yet again more dense. Alex makes a more radical approach proposal to interconnect the rack using very short-range radio. From the paper,

 

Considering their compact size and low heat dissipation, one can imagine building clusters of thousands of low-power Amdahl blades. In turn, this high density will create challenges related to interconnecting these blades using existing communication technologies (i.e., Ethernet, complex wiring if we have 10,000 nodes). On the other hand, current and upcoming high-speed wireless communications offer an intriguing alternative to wired networks. Specifically, current wireless USB radios (and their WLP IP-based variants) offer point-to-point speeds of up to 480 Mbps over small distances (~3-10 meters). Further into the future, 60 GHz-based radios promise to offer Gbps of wireless bandwidth.

 

I’m still a bit skeptical that we can get rack-level radio networking to be win in work done per dollar and work done per joule but it is intriguing and I’m looking forward to getting into more detail on this approach with Alex.

 

Conclusion

Remember it’s work done per dollar and work done per joule that we should be chasing.  And, in optimizing for these metrics, we increasingly face challenges of insufficient I/O and memory bandwidth per core. Both CEMS and Low-Power Amdahl Blades address the system balance issue by applying more low-power servers rather than adding more I/O and memory bandwidth to each server. 

 

It’s the performance of the aggregate cluster we care about and work done dollar and work done per joule is the final arbiter.

 

                                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Sunday, January 25, 2009 7:28:18 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Friday, January 23, 2009

In The Case For Low-Power Servers I reviewed the Cooperative, Expendable, Micro-slice Servers project.  CEMS is a project I had been doing in my spare time in investigating using low-power, low costs servers running internet-scale workloads. The core premise of the CEMS project: 1) servers are out-of-balance, 2) client and embedded volumes, and 3) performance is the wrong metric.

Out-of-Balance Servers:  The key point is that CPU bandwidth is increasing far faster than memory bandwidth (see page 7 of Internet-Scale Service Efficiency).  CPU performance continues to improve at roughly historic rates.  Core count increases have replaced the previous reliance on frequency increase but performance improvements continue unabated.  As a consequence, CPU performance is outstripping memory bandwidth with the result that more and more cycles are spent in pipeline stalls. There are two broad approaches to this problem: 1) improve the memory subsystem, and 2) reduce CPU performance. The former drives up design cost and consumes more power. The later is a counter-intuitive approach.  Just run the CPU slower.  

 

The CEMS project investigates using low-cost, low-power client and embedded CPUs to produce better price-performing servers.  The core observation is that internet-scale workloads are partitioned over 10s to 1000s of servers.  Running more slightly slower servers is an option if it produces better price performance. Raw, single-server performance is neither needed nor the most cost effective goal

 

Client and Embedded Volumes: It’s always been a reality of the server world that volumes are relatively low.  Clients and embedded devices are sold at an over 10^9 annual clip.  Volume drives down costs.  Servers leveraging client and embedded volumes can be MUCH less expensive and still support the workload.

 

Performance is the wrong metric: Most servers are sold on the basis of performance but I’ve long argued that single dimensional metrics like raw performance are the wrong measure. What we need to optimize for is work done per dollar and work done per joule (a watt-second). In a partitioned workload running over many servers, we shouldn’t care about or optimize for single server performance. What’s relevant is work done/$ and work done/joule. The CEMS projects investigates optimizing for these metrics rather than raw performance.

 

Using work done/$ and work done/joule as the optimization point, we tested a $500/slice server design on a high-scale production workload and found nearly 4x improvement over the current production hardware.

Earlier this week Rackable Systems announced Microslice Architecture and Products.  These servers come in at $500/slice and optimize for work done/$ and work done/joule. I particularly like this design in that its using client/embedded CPUS but includes full ECC memory and the price/performance is excellent.  These servers will run partitionable workloads like web-serving extremely cost effectively.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

Friday, January 23, 2009 6:23:00 AM (Pacific Standard Time, UTC-08:00)  #    Comments [7] - Trackback
Hardware
 Monday, January 19, 2009

I recently stumbled across: Snippets on Software.  It’s a collection of mini-notes on software with links to more if you are interested in more detail. Some snippets are wonderful, some clearly aren’t exclusive to software and some I would argue are just plain wrong. Nonetheless, it’s a great list.

 

It’s too long to read from end-to-end in one sitting but it’s well worth skimming. Below a few snippets that I enjoyed to whet your appetite:

 

"there is only one consensus protocol, and that's Paxos" - all other approaches are just broken versions of Paxos. – Mike Burrows

 

Conway’s Law: Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.

 

"Reading, after a certain age, diverts the mind too much from its creative pursuits. Any man who reads too much and uses his own brain too little falls into lazy habits of thinking." -- Albert Einstein

 

A human being should be able to change a diaper, plan an invasion, butcher a hog, conn a ship, design a building, write a sonnet, balance accounts, build a wall, set a bone, comfort the dying, take orders, give orders, cooperate, act alone, solve equations, analyze a new problem, pitch manure, program a computer, cook a tasty meal, fight efficiently, die gallantly. Specialization is for insects. -Robert A. Heinlein

"You can try to control people, or you can try to have a system that represents reality. I find that knowing what's really happening is more important than trying to control people." -- Larry Page

 

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?  --Brian Kernighan

 

                                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Monday, January 19, 2009 9:55:01 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Wednesday, January 14, 2009

The Conference on Innovative Data Systems Research was held last week at Asilomar California. It’s a biennial systems conference.  At the last CIDR, two years ago, I wrote up Architecture for Modular Data Centers where I argued that containerized data centers are an excellent way to increase the pace of innovation in data center power and mechanical systems and are also a good way to grow data centers more cost effectively with a smaller increment of growth.

 

Containers have both supporters and detractors and its probably fair to say that the jury is still out.  I’m not stuck on containers as the only solution but any approach that supports smooth, incremental data center expansion is interesting to me. There are some high scale modular deployments are in the works (First Containerized Data Center Announcement) so, as an industry, we’re starting to get some operational experience with the containerized approach.

 

One of the arguments that I made in the Architecture for Modular Systems paper was that a fail-in-place might be the right approach to server deployment. In this approach, a module of servers (multiple server racks) is deployed and, rather than servicing them as they fail, the overall system capacity just slowly goes down as servers fail. As each fails, they are shut off but not serviced. Most data centers are power-limited rather than floor space limited.  Allowing servers to fail in place trades off space which we have in abundance in order to get high efficiency service. Rather than servicing systems as they fail, just let them fail-in-place and when the module healthy-server density gets too low, send it back for remanufacturing at the OEM who can do it faster, cheaper, and recycle all that is possible.

 

Fail in place (Service Free Systems) was by far the most debated part of the modular datacenter work. But, it did get me thinking about how cheaply a server could be delivered. And, over time, I’ve become convinced that that optimizing for server performance is silly. What we should be optimizing for is work done/$ and work done/joule (a watt-second). Taking those two optimizations points with a goal of a sub-$500 server, led to the Cooperative, Expendable, Micro-Slice Server project that I wrote up for this years CIDR.

 

In this work, we took an existing very high scale web property (many thousands of servers) and ran their production workload on the existing servers currently in use. We compared the server SKU currently being purchased with a low-cost, low-power design using work done/$ and work done/joule as the comparison metric. Using this $500 server design, we were able to achieve:

 

·         RPS/dollar:3.7x

·         RPS/Joule: 3.9x

·         RPS/Rack: 9.4x

 

Note that I’m not a huge fan of gratuitous density (density without customer value).  See Why Blade Servers aren’t the Answer to all Questions for the longer form of this argument. I show density here only because many find it interesting, it happens to be quite high and, in this case, did not bring a cost penalty.

 

The paper is at: http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_CEMS.pdf.

 

Abstract:  evaluates low cost, low power servers for high-scale internet-services using commodity, client-side components. It is a follow-on project to the 2007 CIDR paper Architecture for Modular Data Centers. The goals of the CEMS project are to establish that low-cost, low-power servers produce better price/performance and better power/performance than current purpose-built servers. In addition, we aim to establish the viability and efficiency of a fail-in-place model. We use work done per dollar and work done per joule as measures of server efficiency and show that more, lower-power servers produce the same aggregate throughput much more cost effectively and we use measured performance results from a large, consumer internet service to argue this point.

 

Thanks to Giovanni Coglitore and the rest of the Rackable Systems team for all their engineering help with this work.

 

James Hamilton

Amazon Web Services

james@amazon.com

Wednesday, January 14, 2009 4:39:39 PM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Hardware
 Sunday, January 11, 2009

Last night, TechCrunch hosted The Crunchies and two of my favorite services got awards. Ray Ozzie and David Treadwell accepted Best Technology Innovation/Achievement for Windows Live Mesh.  Amazon CTO Werner Vogels accepted Best Enterprise Startup for Amazon Web Services.

 

Also awarded (from http://crunchies2008.techcrunch.com/)

Best Application Or Service

Get Satisfaction
Google Reader (winner)
Minted
Meebo
MySpace Music (runner-up)
Yelp

Best Technology Innovation/Achievement

Facebook Connect (runner-up)
Google Friend Connect
Google Chrome
Windows Live Mesh (winner)
Swype
Yahoo BOSS

Best Design

Animoto (runner-up)
Cooliris (winner)
Friendfeed
Infectious
Lala
Sliderocket

Best Bootstrapped Startup

BackType
GitHub (winner)
Socialcast
StatSheet
12seconds.tv (runner-up)

Most Likely To Make The World A Better Place

Akoha
Causes
CO2Stats
GoodGuide (winner)
Kiva (runner-up)
Better Place

Best Enterprise Startup
Amazon Web Services
(winner)
Force.com
Google App Engine (runner-up)
Yammer
Zoho

Best International Startup

eBuddy (winner)
Fotonauts
OpenX
Vente-privee
Wuala (runner-up)

Best Clean Tech Startup

Better Place (runner-up)
Boston Power
ElectraDrive
Laurus Energy
Project Frog (winner)

Best New Gadget/Device

Android G1 (runner-up)
Ausus EEE 1000 Series
Flip MinoHD
iPhone 3G (winner)
SlingCatcher

Best Time Sink Site/Application

Mob Wars
iBowl
Tap Tap Range (winner)
Zivity
Texas Hold Em (runner-up)

Best Mobile Startup

ChaCha (runner-up)
Evernote (winner)
Posterous
Qik Skyfire
Truphone

Best Mobile Application

Google Mobile Application (runner-up)
imeem mobile (winner)
Pandora Radio
rolando
ShopSavvy
Ocarina

Best Startup Founder

Linda Avery and Anne Wojcicki (23andMe)
Michael Birch and Xochi Birch (Bebo)
Robert Kalin (Etsy)
Evan Williams, Jack Dorsey, Biz Stone (Twitter ) (winner)
Paul Buchheit, Jim Norris, Sanjeev Singh, Bret Taylor (FriendFeed ) (runner-up)

Best Startup CEO

Tony Hsieh (Zappos)
Jason Kilar (Hulu) (runner-up)
Elon Musk (SpaceX)
Andy Rubin (Android)
Mark Zuckerberg (Facebook) (winner)

Best New Startup Of 2008
Dropbox
(runner-up)
FriendFeed (winner)
GoodGuide
Tapulous
Topsin Media
Yammer

Best Overall Startup In 2008

Amazon Web Services
Facebook (winner)
Android
hulu
Twitter (runner-up)

James Hamilton

Amazon Web Services

james@amazon.com

 

Sunday, January 11, 2009 8:52:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Saturday, January 10, 2009

Back in 2000, Joel Spolsky published a set of 12 best practices for a software development team. It’s been around for a long while now and there are only 12 points but it’s very good. Simple, elegant, and worth reading: The Joel Test: 12 Steps to Better Code.

 

Thanks to Patrick Niemeyer for sending this one my way.

 

                                                                --jrh

 

James Hamilton

Amazon Web Services

James@amazon.com

Saturday, January 10, 2009 8:12:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<February 2009>
SunMonTueWedThuFriSat
25262728293031
1234567
891011121314
15161718192021
22232425262728
1234567

Categories
This Blog
Member Login
All Content © 2012, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton