Friday, April 18, 2008

In the Rules of Thumb post, I argued that many of the standard engineering rules the thumb are changing. On a closely related point, Nishant Dani and Vlad Sadovsky both pointed me towards The Landscape of Parallel Computing Research: A View from Berkeley by David Patterson et al. Dave Patterson is best known for foundational work on RISC and for co-inventing RAID.  He has an amazing ability to spot a problem where the solution is near, the problem is worth solving, and then come up with practical solutions.  This paper has many co-authors but shows some of that same style.  It focuses on parallel systems and some of the conventional wisdom that has driven systems designs for some time that are no longer correct.  The Berkeley web site with more detail is at:


In the paper they argue that 13 computational kernels can be used to characterize most workloads.  Then they go on to observe that over ½ of these kernels are memory bound today and we expect more to be in the future.  In effect, the problem is getting data up the storage and memory hierarchy to the processors not the speed of the processors themselves. This has been true for years and the problems worsens each year and yet it still seems that the problem gets less focus than scaling processors speeds even though the later won’t help without the first.


If you are interested in parallel systems, it’s worth reading the paper.  I’ve included the key changes in conventional wisdom below:


1. Old CW: Power is free, but transistors are expensive.

· New CW is the “Power wall”: Power is expensive, but transistors are “free”. That

is, we can put more transistors on a chip than we have the power to turn on.

2. Old CW: If you worry about power, the only concern is dynamic power.

· New CW: For desktops and servers, static power due to leakage can be 40% of

total power. (See Section 4.1.)

3. Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors

occurring only at the pins.

· New CW: As chips drop below 65 nm feature sizes, they will have high soft and

hard error rates. [Borkar 2005] [Mukherjee et al 2005]

4. Old CW: By building upon prior successes, we can continue to raise the level of

abstraction and hence the size of hardware designs.

· New CW: Wire delay, noise, cross coupling (capacitive and inductive),

manufacturing variability, reliability (see above), clock jitter, design validation,

and so on conspire to stretch the development time and cost of large designs at 65

nm or smaller feature sizes. (See Section 4.1.)

5. Old CW: Researchers demonstrate new architecture ideas by building chips.

· New CW: The cost of masks at 65 nm feature size, the cost of Electronic

Computer Aided Design software to design such chips, and the cost of design for

GHz clock rates means researchers can no longer build believable prototypes.

Thus, an alternative approach to evaluating architectures must be developed. (See

Section 7.3.)

6. Old CW: Performance improvements yield both lower latency and higher


· New CW: Across many technologies, bandwidth improves by at least the square

of the improvement in latency. [Patterson 2004]

7. Old CW: Multiply is slow, but load and store is fast.

· New CW is the “Memory wall” [Wulf and McKee 1995]: Load and store is slow,

but multiply is fast. Modern microprocessors can take 200 clocks to access

Dynamic Random Access Memory (DRAM), but even floating-point multiplies

may take only four clock cycles.

The Landscape of Parallel Computing Research: A View From Berkeley


8. Old CW: We can reveal more instruction-level parallelism (ILP) via compilers

and architecture innovation. Examples from the past include branch prediction,

out-of-order execution, speculation, and Very Long Instruction Word systems.

· New CW is the “ILP wall”: There are diminishing returns on finding more ILP.

[Hennessy and Patterson 2007]

9. Old CW: Uniprocessor performance doubles every 18 months.

· New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall. Figure 2 plots

processor performance for almost 30 years. In 2006, performance is a factor of

three below the traditional doubling every 18 months that we enjoyed between

1986 and 2002. The doubling of uniprocessor performance may now take 5 years.

10. Old CW: Don’t bother parallelizing your application, as you can just wait a little

while and run it on a much faster sequential computer.

· New CW: It will be a very long wait for a faster sequential computer (see above).

11. Old CW: Increasing clock frequency is the primary method of improving

processor performance.

· New CW: Increasing parallelism is the primary method of improving processor

performance. (See Section 4.1.)

12. Old CW: Less than linear scaling for a multiprocessor application is failure.

· New CW: Given the switch to parallel computing, any speedup via parallelism is a



James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:



Friday, April 18, 2008 4:42:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, April 16, 2008

How to ensure that data written to disk, is REALLY on disk?  Yeah, I know, this shouldn’t be hard but the I/O stack is deep, everyone is looking for performance, everyone is caching along the way, so it’s more interesting than you might like.  If you writing code that needs to reliable write through semantics like Write Ahead Logging, then you need to ensure you are writing through to media. If you are writing to a SAN or SCSI, it’s pretty straight forward but if you are using EIDE or SATA, then things get a bit more interesting. What follows is Windows-specific but you need to be aware of these issues on non-Windows systems as well.


If it’s a SCSI disk (not SATA or EIDE), then setting FILE_FLAG_WRITE_THROUGH and FILE_FLAG_NO_BUFFERING is sufficient.  FILE_FLAG_WRITE_THROUGH force all data written to the file to be written through the cache directly to disk. All writes are to the media.  FILE_FLAG_NO_BUFFERING ensures that all reads come directly from the media as well by preventing any read ahead and disk caching. What’s happening behind the scenes when these parameters are specified on CreateFile() is that the filsystem and memory manager are not caching and Force Unit Access (FUA) is being sent to the device on writes to ensure they are directly to the media rather than cached in the device cache


The reason the above is not typically sufficient with EIDE and SATA drives is that FUA is dropped by the standard SATA and EIDE miniport driver.  The filesystem and memory manager will respect the parameters but the device will likely still cache writes without FUA.


FUA is dropped for performance reasons since SATA and EIDE can only process one command at a time and the full flush required by FUA is slow. SCSI can process multiple commands in parallel and the flush is less expensive. Is Native Command Queuing (NCQ) the solution to the performance problem? Unfortunately, no.  NCQ allows multiple commands to be sent to the drive, it gives the drive flexibility in what order to execute the commands but the restriction of only one command executing at a time remains.


What’s the solution to getting reliable writes when using commodity disks and needing guaranteed writes. The simple answer is to set the registry flag that turns off the discarding of FUA. This solve the correctness problem but at considerable performance expense. Essentially this will be semantically correct but slow due to the SATA single-command limitation and the length of time it takes to go directly to the media.  Shutting of Write Cache Enable (WCE) on a per-drive basis is another option.


Another option is FlushFileBuffers() which is a call fully honored by all device types. FlushFileBuffers takes a file handle arguments and flushes the filesystem/memory manager cache for that handle and flushes the entire system volume that holds that file.  This again works but is broader than required in that the entire device cache will get flushed.  I’m told that you can also use FLUSH_CACHE on the device as an alternative to FlushFileBuffers() on a handle. A paper that shows the use of FLUSH_CACHE to achieve correct write ahead logging semantics is up at: Enforcing Database Recoverability on Disks that Lack Write-Through.  In this paper, using SQL Server running a mini-TPC-C as a test case, the measure performance degradation of as little 2% using FLUSH_CACHE calls to the device as needed. A small price to pay for correctness.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | Msft internal blog: msblogs/JamesRH


Wednesday, April 16, 2008 5:59:43 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Monday, April 14, 2008

Wow, the pace is starting to pick up in the service platform  world. Google announced their long awaited entrant with Google Application Engine last Monday, April 7th. Amazon announced the SimpleDB to answer the largest requirement they were hearing from AWS customers: persistent, structured storage. Yesterday, another major step was made with Werner Vogles announcing availability of persistent storage for EC2.

Persistence for EC2 is a big one.  I’ve been amazed at how hard customers were willing to work to get persistent storage in EC2.  The most common trick is to periodically snapshot the up to 160GB of ephemeral state allocated to each Amazon EC2 instance to S3. This does work but is very clunky and looses all between the last snap shot and non-orderly shutdown is a bit nasty.  A solution I like is a replicated block storage layer like DRBD.  One innovative solution to all EC2 state being transient is to use DRDB to maintained a replicated file system between two EC2 instances.  Not bad – in fact I really like it but it’s hard to set up and, last time I checked, only supported 2-way redundancy when 3 is where you want to be when using commodity hardware.


It appears the solution is (nearly) here with EC2 persistence.  The model they have chosen storage volume as the abstraction.  Any number of storage volumes can be created in sizes of up to 1TB. Each storage volume is created in a developer specified availability zone and each volume supports snapshots to S3. A volume can be created from a snap-shot.  The supported redundancy and recovery models were not specified but I would expect that they are using redundant, commodity storage. Werner did say it was file system semantics which I interpret as cached, asynchronous write with optional application controlled write through/flush.  It is not clear if shared volumes are supported (multiple EC2 instances accessing the same volume).


Another blog entry from Amazon “demo’s” Usage: I spent some time experimenting with this new feature on Saturday. In a matter of minutes I was able to create a pair of 512 GB volumes, attach them to an EC2 instance, create file systems on them with mkfs, and then mount them. When I was done I simply unmounted, detached, and then finally deleted them.


Unfortunately, persistent storage for EC2 won’t be available until “later this year” but it looks like a good feature that will be well received by the development community.


Update: This may be closer to beta than I thought.  I just (5:52am 4/14) reciveved a limited beta invitation.




Thanks to David Golds and Dare Obasanjo for sending pointers my way.


James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:



Monday, April 14, 2008 4:37:39 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Saturday, April 12, 2008

The only thing worse than no backups is restoring bad backups. A database guy should get these things right.  But, I didn’t, and earlier today I made some major site-wide changes and, as a side effect, this blog was restored to December 4th, 2007.  I’m working on recovering the content and will come up with something over the next 24 hours. However it’s very likely that comments between Dec 4th and earlier today will be lost.  My apologies.


Update 2008.04.13: I was able to restore all content other than comments between 12/4/2007 and yesterday morning.  All else is fine.  I'm sorry about the RSS noise during the restore and for the lost comments.  The backup/restore procedure problem is resolved.  Please report any broken links or lingering issues. Thanks,





James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:



Saturday, April 12, 2008 11:16:29 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware | Process | Ramblings | Services | Software

I was on a panel at the International Conference on Data Engineering yesterday morning in Cancun, Mexico but I was only there for Friday. You’re probably asking “why would someone fly all the way to Cancun for one lousy day?”  Not a great excuse, but it goes like this: the session was originally scheduled for Wednesday and I was planning to attend the entire conference since I haven’t been to a pure database conference in a couple of years.  But, it was later moved to Friday mid-morning and work is piling up at the day job so I ended up deciding to just fly in for the day. 


It was such a short trip that I ended up flying both in and out of Cancun with the same flight crew.  They offered me a job so I’ve now got a back-up plan at Alaska Air in case the distributed systems market goes soft.    Actually, I had some company.  Hector Garcia-Molina and I arrived at the airport at same time Thursday and went out at the same time Friday. Hector was flying in for his Friday morning keynote PhotoSpread: A Spreadsheet for Managing Photos.


The panel I participated in on Friday was “Cloud Computing-Was Thomas Watson Right After All?” organized by Raghu Ramakrishnan of Yahoo! Research.  The basic premise of the panel is that the much of the current server-side workloads are migrating to the cloud and this trend is predicted by many to accelerate. I partly agreed and partly disagreed.  From my perspective the broad move to a services-based model is inescapable. The economics are simply too compelling.  But, at the same time that I see a massive migration to a service based model, the capabilities of the edge are growing faster than ever.  One billion cell phones will sell this year.  Personal computer sales remain robust.  The edge will always have more compute, more storage, and less latency.  I argue that we will continue to see more conventional enterprise workloads move to a service-based model each year. And, at the same time, we’ll see increased reliance on the capabilities of edge devices. More service based applications will be dependent upon large local caches supporting low latency access and disconnected operation and deep, highly engaging user interfaces.  Basically service based applications exploiting local device capabilities and interfaces (Browser-Hosted Software with a "Real" UX ). 


My summary: the edge pulls computation close to the user for the best possible user experience. The core pulls computation close to data.  Basically, I’m arguing both will happen.


Looking more closely at the mass migration of many of the current enterprise workloads to a services-based delivery model, the driving factor is lower cost and freeing up IQ to work on the core business. When there is an order of magnitude in cost savings possible, big changes happen. In many ways the predicted mass move to services reminds me of the move to packaged Enterprise Resource Planning software 10 to 20 years back.  Before then, most enterprises wrote all their own internal systems, which were incredibly expensive but 100% tailored to their unique needs.  It was widely speculated that no large company would ever be willing to change their business sufficiently to use commercial ERP software. And, they probably wouldn’t have if it wasn’t for the several-factor difference in price.  The entire industry moved to packaged ERP software at an incredible pace.  Common applications like HR and accounting are now typically sourced commercially by even the largest enterprises and they invest in internal development where they need to innovate or add significant value (generally, ignoring Enron, you don’t really want to innovate too much in accounting). 


The same thing is happening with services.  Just as before, I frequently hear that no big enterprise will move to a services-based model due to security and privacy reasons and a need to tailor their internal applications for their own use.  And, again, the cost difference is huge and I fully expect the results will be the same: common applications where the company is not doing unique innovation will move to a services-based model.  In fact, it’s already happening. Even as early as a couple of years back when I led Exchange Hosted Services, I was amazed to find that many of largest household name enterprises are moving some of their applications to a services-model.  It’s happening.


The slides I presented at ICDE: JamesRH_ICDE2008x.ppt (749.5 KB).




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Friday, April 11, 2008 11:12:32 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, April 09, 2008

What’s commonly referred to as the Great Firewall of China isn’t really a firewall at all.  I recently came across an Atlantic Monthly article investigating how the Great Firewall works and what it does (see The Connection has been Reset).


The official name of what is often called the Great Firewall of China is the Golden Shield project. Rather than acting as a firewall, it’s actually mirroring content and manipulating DNS, connection management, and URL redirection to implement its goal of restricting what internet users inside China can access.


This project has been widely criticized on political and social fronts – I won’t repeat them here.  It’s also been widely criticized on technical grounds as ineffective, weak, and easy to thwart.  Again, not my focus.  This article simply caught my interest technically as content filtering at this scale is an incredibly difficult task. What techniques are employed?


Like many software security problems, no single solution solves the problem fully and the main goal of the Golden Shield project is to add friction.  If it’s painful enough to get to the content they are trying to prevent from being accessed, few will bother to access it.  Essentially the goal of the four levels of protection they are using is to add friction and it’s friction rather than prevention that ensures that few Chinese internet users see restricted content in any quantity.  The four levels of protection/restriction are:


1.       DNS Block: sites that are on the current blacklist get DNS resolution failure or get redirected to other content.  This was the technique employed against to force them add filtering to their web index. For some time , all access to was redirected to their larger Chinese competitor baidu.  The other application of this technique is to return DNS lookup failure so, for example, searches for will return “not found”.

2.       Connect: In parallel with connection requests leaving China, they are inspected.  If the IP address is on the current IP blacklist, connection reset will be sent which will cause the connection to fail.

3.       URL Block: If the URL contains words on the illegal word blacklist, the connection is redirected infinitely.  I’m not sure if they are only sniffing the URL or also doing reverse DNS to get the site name as well but, if unacceptable words are found in the URL, they redirect the connection repeated. Some browsers hang while others return an error message.

4.       Content Block: At this level the DNS lookup has been successful and the connection has been made and content is being returned to the user. As the content is returned to the requesting user inside China, it’s being scanned in parallel for unapproved keywords and phrases. If any are found, the connection is broken immediately. As well as breaking the connection mid-way, subsequent requests from that client IP to that destination IP are blocked. The first block is short, but consecutive attempts drive up the length of the IP-to-IP connect block period and may eventually draw official scrutiny.


In addition to these techniques to block access to content outside-of-China, an estimate 30,000 censors scan and get removed unapproved content posted within within China (see


The Golden Shield project is reportedly also being used in the opposite direction to prevent access to some content inside of China from outside the country.


There are many means of subverting the Golden Shield including using a proxy server outside of China or setting up a VPN connection to a server outside of the country.  Encrypted connections will also get through as well encrypted email.  However, all these techniques are non-default and require some work on behalf of the user.  Most users don’t bother so, for the most part, the goals of the Golden Shield are attained even though it’s technically not that strong.


The Atlantic Monthly article:

Wired Article:

Wikipedia article:




Thanks to Jennifer Hamilton and Mitch Wyle for pointing out the Atlantic Monthly article.


James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Tuesday, April 08, 2008 11:13:57 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Saturday, April 05, 2008

The services world is one built upon economies of scale.  For example, networking costs for small and medium sized services can run nearly an order of magnitude more than large bandwidth consumers such as Google, Amazon, Microsoft and Yahoo pay. These economies of scale make it possible for services such as Amazon S3 to pass on some of the economies of scale they get on networking, for example, to those writing against their service platform while at the same profiting (S3 is currently pricing storage under their cost but that’s a business decision rather than a business model problem). These economies of scale enjoyed by large service providers extend beyond networking to server purchases, power costs, networking equipment, etc.


Ironically, even with these large economies of scale, it’s cheaper to compute at home than in the cloud. Let’s look at the details.


Infrastructure costs are incredibly high in the services world with a new 13.5 mega-watt data center costing over $200m before the upwards of 50,000 servers that fill the data center are purchased.  Data centers are about the furthest thing from commodity parts and I have been arguing that we should be moving to modular data centers  for years (there has been progress on that front as well: First Containerized Data Center Announcement).  Modular designs take some of the power and mechanical system design from an upfront investment with 15 year life to a design that comes with each module and is on a three year or less amortization cycle and this helps increase the speed of innovation. 


Modular data centers help but they still require central power, mechanical systems, and networking systems and these systems remain expensive, non-commodity components. How to move the entire datacenter to commodity components?  Ken Church ( makes a radical suggestion: rather than design and develop massive data centers with 15 year lives, let’s incrementally purchase condominiums (just-in-time) and place a small number of systems in each.  Radical to be sure but condo’s are a commodity and, if this mechanism really was cheaper, it would be a wake-up call to all of us to start looking much more closely at current industry-wide costs and what’s driving them. That’s our point here.


Ken and I did a quick back of envelope of this approach below.   Both configurations are designed for 54k servers and roughly 13.5MWs.  Condos appear notably cheaper, particularly in terms of capital.   




Large Tier II+ Data Center

Condo Farm (1125 Condos)




54k (= 48 servers/condo * 1125 Condos)




Power (peak)

13.5 MW (= 250 Watts/server * 54k servers)

13.5MW (= 250 Watts/server * 54k servers  = 12 KW/condo * 1125 Condos)







over $200M

$112.5M (= $100k/condo * 1125 Condos)





Annual Expense


$3.5M/year (= $0.03 per kw/h * 24*356 hours/year * 13.5MW)

$10.6M/year (= $0.09 per kw/h * 24*365 hours/year * 13.5MW)





Annual Income

Rental Income


$8.1M/year (= $1000/condo per month * 12 months/year * 1125 Condos less $200/condo per month condo fees. We conservatively assume 80% occupancy)



In the quick calculation above, we have the condos at $100k each and all 1,125 of them at $112.5M whereas the purpose built data center would price in at over $200M.  We have assumed an unusually low cost for power on the purpose built center with a 66% reduction over standard power rates. Deals this good are getting harder to negotiate but they still do exist.  The condo must pay full residential power costs without discount which is far higher at $10.6M/year.  However, offsetting this increased power cost, we rent the condo’s out at a low cost of $1,000/month and conservatively only account for 80% occupancy.


Looking at the totals, the condo’s are at 56% of the capital cost and annually they run $2.5M in operational costs whereas the data center power costs are higher at $3.5m.  The condos operational costs are 71% of the purpose built design.  Summarizing, the condo’s run just about ½ the cost of the purpose built data center both in capital and in annual operating costs.


Condos offer the option to buy/sell just-in-time.  The power bill depends more on average usage than worst-case peak forecast.  These options are valuable under a number of not-implausible scenarios:

·         Long-Term demand is far from flat and certain; demand will probably increase, but anything could happen over the next 15 years

·         Short-Term demand is far from flat and certain; power usage depends on many factors including time of day, day of week, seasonality, economic booms and busts.  In all data centers we’ve looked at average power consumption is well below worst-case peak forecast.


How could condos compete or even approach the cost of a purpose built facility built where land is cheap and power is cheaper?  One factor is that condos are built in large numbers and are effectively “commodity parts”.  Another factor is that most data centers are over-engineered.  They include redundancy such as uninterruptable power supplies that the condo solution doesn’t include.  The condo solution gets it’s redundancy via many micro-data centers and being able to endure failures across the fabric. When some of the non-redundantly powered micro-centers are down, the others carry the load. (Clearly achieving this application-level redundancy requires additional application investment).


One particularly interesting factor is when you buy large quantities of power for a data center, it is delivered by the utility in high voltage form. These high voltage sources (usually in the 10 to 20 thousand volt range) need to be stepped down to lower working voltages which brings efficiency losses, distributed throughout the data center which again brings energy losses, and eventually delivered to the critical load at the working voltage (240VAC is common in North America with some devices using 120VAC). The power distribution system represents approximately 40% of total cost of the data center. Included in that number are the backup generators, step-down transformers, power distribution units, and uninterruptable power supplies. Ignore the UPS and generators since we’re comparing non-redundant power, and two interesting factors jump out: 1) the cost of the power distribution system ignoring power redundancy is 10 to 20% of the cost of the data center and 2) the power losses through distribution run 10 to 12% of the power brought into the center.


This is somewhat ironic in that a single family dwelling gets two-phase 120VAC (240VAC between the phases or 120VAC between either phase and ground) delivered directly to the home.  All the power losses experienced through step down transformers (usually in the 92 to 96% efficiency range) and all the power lost through distribution (depends upon size and length of conductor) is paid for by the power company. But, if you buy huge quantities of power as we do in large data centers, the power company delivers high voltage lines to the property and you need to pay the substantial capital cost of step down transformers and, in addition, pay for the power distribution losses.  Ironically, if you don’t buy much power, the infrastructure is free. If you buy huge amounts, you need to pay for the infrastructure.  In the case of condos, the owners need to pay for the inside the building distribution so they are somewhere between single family dwellings and data centers in having to pay for part of the infrastructure but not as much as a DC.


Perhaps, the power companies have found a way to segment the market into consumer v. business.  Businesses pay more because they are willing to pay more.  Just as businesses pay more for telephone service and airplane travel, businesses also pay more for power.  Despite great deals we’ve been reading about, data centers are actually paying more for power than consumers after factoring in the capital costs.    Thus, it is a mistake to move computation from the home to the cloud because doing so moves the cost structure from consumer rates to business rates.


The condo solution might be pushing the limit a bit but whenever we see a crazy idea even within a factor of two of what we are doing today, something is wrong.  Let’s go pick some low hanging fruit.


Ken Church & James Hamilton

{Church, JamesRH}


Saturday, April 05, 2008 11:15:44 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
 Thursday, April 03, 2008

A couple of interesting directions brought together: 1) Oracle compatible DB startup, and 2) a cloud-based implementation.


The Oracle compatible offering is EnterpriseDB. They use the PostgreSQL code base and implement Oracle compatibility to make it easy for the huge Oracle install base to support them.  An interesting approach.  I used to lead the SQL Server Migration Assistant team so I know that true Oracle compatibility is tough but, even failing to be 100% compatible makes it easier for Oracle apps to port over to them. The pricing model is free for a developer license and $6k/socket for their Advanced Server edition.


The second interesting direction is offering is from Elastra.  It’s a management and administration system that automates deploying and managing dynamically scalable services. As part of the Elastra offering is support for Amazon AWS EC2 deployments.


Bring together EnterpriseDB and Elastra and you have an Oracle compatible database, hosted in EC2, with deployment and management support: ELASTRA Propels EnterpriseDB into the Cloud. I couldn’t find any customer usage examples so this may be more press release than a fully exercised, ready for prime-time solution but it’s a cool general direction and I expect to see more offerings along these lines over next months.  Good to see.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Thursday, April 03, 2008 11:17:15 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Wednesday, April 02, 2008

I’m a big believer in auto-installable client software but I also want a quality user experience.  For data intensive applications, I want a caching client. I use and love many of browser-hosted clients but, for development work, email clients, and photo editing, I still use installable software. I want a snappy user experience, I need to be able to run disconnected or weakly connected, and I want to fully use my local resources.  Speed and richness is king for these apps – it’s the casual apps that are getting replaced well by browser based software in my world. 


However, I’ve been blown away but how fast the set of applications I’m willing to run in the browser has been expanding. For example, Yahoo Mail impressed me when it came out. Both Google and Live maps are impressive (how can anyone understand and maintain that much JavaScript?).  In fact, in the ultimate compliment, these mapping services are good enough that, even though I have local mapping software installed, I seldom bother to start it.  


Here’s another one that announced last week that is truly impressive:  The Adobe online implementation of Photoshop is an eye opener. Predictably, it’s flash and flex based and, wow, it’s amazing for a within-the-browser experience.  I’m personally still editing my pictures locally but Photoshop Express shows a bit of what’s possible.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Wednesday, April 02, 2008 11:18:16 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services | Software
 Tuesday, April 01, 2008

Microsoft has been investigating and testing containers and modular data centers for some time now.  I wrote about them some time back in Architecture for Modular Data Centers (presentation) at the 2007 Conference on Innovative Data Research. Around that time Rackable Systems and Sun Microsystems announced shipping container based solutions and Rackable shipped the first production container.  That first unit had more than 1,000 servers.  Rackable and Sun helped get this started as early on most of the industry was somewhere between skeptical and actively resistant.


Over the last couple of years, the modular datacenter approach has gained momentum.  Now nearly all data center equipment providers have started offering container based solutions

·         IBM Scalable modular data center

·         Rackable ICE Cube™ Modular Data Center

·         Sun Modular Datacenter S20 (project Blackbox)

·         Dell Insight

·         Verari Forest Container Solution


It’s great to see all the major systems providers investing modular data centers. I expect the pace of innovation to pick up and over the last two weeks I’ve seen three new designs.  Things are moving.


Yesterday Mike Manos who leads the Microsoft Global Foundations Data Center team made the first public announcement of a containerized production data center at Data Center World. The Microsoft Chicago facility is a two floor design where the first floor is a containerized design housing 150 to 220 40’ containers each 1,000 to 2,000 servers.   Chicago is a large facility with the low end of the ranges Mike quoted yielding 150k serves and the high end running to 440k servers.  If you assume 200W/server, the critical load would run between 30MW and 88MW for the half of the data center that is containerized.  If you conservatively assume a PUE of 1.5, we can estimate the containerized portion of the data center at between 45MW and 132MW total load.  It’s a substantial facility.


John Rath posted great notes on Mike’s entire talk:  And, I’m excited about this new news now being public, so when Mike gets back into the office at Redmond I’ll pester him to see if he can release the slides he used.  If so, I’ll post them here.


Thanks to Rackable Systems and Sun Microsystems for getting the industry started on commodity-based containerized designs.  We now have modular components from most major server vendors and Mike’s talk yesterday at Data Center World market the first publically announced modular facility.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | | 

Tuesday, April 01, 2008 11:19:54 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
 Monday, March 31, 2008

Tom Kleinpeter was one of the founders of Foldershare (acquired by Microsoft in 2006) and before that he was a part of the original team at Audiogalaxy. I worked with Tom while he was at Microsoft working on Mesh. Tom recently decided to take some time off, to relax, be a father, and it looks like he’s also finding time to put write up some of his experiences. I particularly like the Audiogalaxy Chronicles where he writes up his experiences with Audiogalaxy which grew like only successful startup can shooting to 80 million page views a day from 35 million unique users.


I found this post particularly interesting where Tom describes scaling the Audiogalaxy design and some of the challenges they had in scaling to 80 million page views a day:


Read them all:




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Monday, March 31, 2008 11:21:05 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Sunday, March 30, 2008

There is (again) a rumor out there that Google will soon offer a third party service platform:  I mostly ignore the rumors but this is one I find hard to ignore. Why?  Mostly because it makes too much sense.  The Google infrastructure investment combined with phenomenal scale yields some of the lowest cost compute and storage in the industry.  They can sell compute and storage at considerably above their costs and yet still be offering substantial cost reductions to smaller services.  That’s if they chose to charge for it.  Google also has the highest scale advertising platform in the world offering opportunity to monetize even that for which they don’t directly charge.  When something looks like it makes sense economically and fits in strategically, it just about has to happen.


We all know that these rumors often have nothing at all behind them.  Some are simply excited fabrications. But, even knowing that, on this one it’s a matter of when rather than if.


Thanks to Dare Obasanjo for pointing me to the blog posting above.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Sunday, March 30, 2008 9:23:04 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Thursday, March 27, 2008

Yahoo! hosted the Hadoop Summit Tuesday of this week.  I posted my rough notes on the conference over the course of the day – posting summarized some of what caught my interest and consolidates my notes.


Yahoo expected 100 attendees and ended up having to change venues to get closer to fitting the more than 400 who wanted to attend.  For me the most striking thing is that Hadoop is now clearly in broad use and at scale. Dave Cutting did a quick survey at the start and rough ½ the crowd were running Hadoop in production and around 1/5 have over 100 node clusters. Yahoo remains the biggest with 2,000 nodes in their cluster.


Christian Kunz of Yahoo! gave a bit of a window into how Yahoo! is using Hadoop to process their Webmap data store. The Webmap is a structured storage representation of all Yahoo! crawled pages and all the metadata they extract or compute on those pages.  There are over 100 Webmap applications used in managing the Yahoo! indexing engine. Christian talked about why they moved to Hadoop from the legacy system and summarized the magnitude of the workload they are running. These are almost certainly the largest Hadoop jobs in the world. The longest map/reduce jobs run for over three days and have 100k maps and 10k reduces. This job reads 300 TB and produces 200 TB.


Another informative talk was given by the Facebook team. They described Hive, the data warehouse at Facebook.  Joydeep Sarma and Ashish Thusoo presented this work. I liked this talk as it was 100% customer driven. They implemented what the analyst and programmers inside Facebook needed and I found their observations credible and interesting.  They reported that Analyst are used to SQL and found a SQL like language most productive but that programmers like to have direct access to map/reduce primitives.  As a consequence, they provide both (so do we).  The Facebook team reports they roughly 25% of the development team using Hive and process 3,500 map/reduce jobs a week.


Google is heavily invested in Hadoop using it as a teaching vehicle even though it’s not used internally.  The Google interest in Haddop is to get graduating students more familiar with the map/reduce programming model. Several schools have agreed to teach the map/reduce programming using Hadoop. For example Berkeley, CMU, MIT, Stanford, UW, and UMD all plan courses


The agenda for the day:







Welcome & Logistics

Ajay Anand, Yahoo!


Hadoop Overview

Doug Cutting / Eric Baldeschwieler, Yahoo!



Chris Olston, Yahoo!



Kevin Beyer, IBM





Michael Isard, Microsoft


Monitoring Hadoop using X-Trace

Andy Konwinski and Matei Zaharia, UC Berkeley



Ben Reed, Yahoo!





Michael Stack, Powerset


Hbase at Rapleaf

Bryan Duxbury, Rapleaf



Joydeep Sen Sarma / Ashish Thusoo, Facebook


GrepTheWeb - Hadoop an AWS

Jinesh Varia,




Building Ground Models of Southern California

Steve Schlosser, David O'Hallaron, Intel / CMU


Online search for engineering design content

Mike Haley, Autodesk


Yahoo - Webmap

Arnab Bhattacharjee, Yahoo!


Natural language Processing

Jimmy Lin, U of Maryland / Christophe Bisciglia, Google


Panel on future directions

Sameer Paranjpye, Sanjay Radia, Owen O.Malley (Yahoo), Chad Walters (Powerset), Jeff Eastman (Mahout)

My more detailed notes are at: HadoopSummit2008_NotesJamesRH.doc (81.5 KB). Peter Lee’s Hadoop Summit summary is at:


James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Thursday, March 27, 2008 11:53:46 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Tuesday, March 25, 2008

HBase: Michael Stack (Powerset)

·         Distributed DB built on Hadoop core

·         Modeled on BigTable

·         Same advantages as BigTable:

o   Column store

§  Efficient compression

§  Support for very wide tables when most columns aren’t looked at together

o   Nulls stored for free

o   Cells are versioned (cells addressed by row, col, and timestamp)

·         No join support

·         Rows are ordered lexicography

·         Columns grouped into columnfamilies

·         Tables are horizontally partitioned into regions

·         Like Hadoop: master node and regionServers

·         Client initially goes to master to find the RegionServer. Cached thereafter. 

o   On failure (or split) or other change, fail the client and it will go back to master.

·         All java access and implementation. 

o   Thrift server hosting supports C++, Ruby, and Java (via thrift) clients

o   Rest  server supports Ruby gem

·         Focusing on developer a user/developer base for HBase

·         Three committers: Jim Bryan Duxbury, and Michael Stack


Hbase at Rapleaf: Bryan Duxbury

·         Rapleaf is a people search application.  Supports profile aggregation, Data API

·         “It’s a privacy tool for yourself and a stalking tool for others”

·         Customer Ruby web crawler

·         Index structured data from profiles

·         They are using HBase to store pages (HBase via REST servlet)

·         Cluster specs:

o   HDFS/Hbase cluster of 16 macdhines

o   2TB of disk (big plans to grow)

o   64 cores

o   64GB memory

·         Load:

o   3.6TB/month

o   Average row size: 65KB (14KB gzipped)

o   Predominantly new rows (not versioned)


Facebook Hive: Joydeep Sen Sarma & Ashish Thusoo (Facebook Data Team)

·         Data Warehousing use Hadoop

·         Hive is the Facebook datawarehouse

·         Query language brings together SQL and streaming

o   Developers love direct access to map/reduce and streaming

o   Analyst love SQL

·         Hive QL (parser, planner, and execution engine)

·         Uses the Thrift API

·         Hive CLI implemented in Python

·         Query operators in initial versions

o   Projections, equijoins, cogroups, groupby, & sampling

·         Supports views as well

·         Supports 40 users (about 25% of engineering team)

·         200GB of compressed data per day

·         3,514 jobs run over the last 7 days

·         5 engineers on the project

·         Q: Why not use PIG? A: Wanted to support SQL and python.


Processing Engineering Design Content with Hadoop and Amazon

·         Mike Haley (Autodesk)

·         Running classifiers over CAD drawings and classifying them according to what the objects actually are. The problem they are trying to solve is to allow someone to look for drawings of wood doors and to find elm doors, wood doors, pine doors and not find non-doors.

·         They were running on an internal autodesk cluster originally. Now running on an EC2 cluster to get more resources in play when needed.

·         Mike showed some experimental products that showed power and gas consumption over entire cities by showing the lines and using color and brightness to show consumption rate.  Showed the same thing to show traffic hot spots.  Pretty cool visualizations.


Yahoo! Webmap: Christian Kunz

·         Webmap is now build in production usng Hadoop

·         Webmap is the a gigantic table o finformation about every web site, page, and link Yahoo! tracks.

·         Why port to Hadoop

o   Old system only scales to 1k nodes (Hadoop cluster at Y! is at 2k servers)

o   One failed or slow server, used to slow all

o   High management costs

o   Hard to evolve infrastructure

·         Challenges: port ~100 webmap applications to map/reduce

·         Webmap builds are not done on latest Hadoop release without any patches

·         These are almost certainly the largest Hadoop jobs in the world:

o   100,000 maps

o   10,000 reduces

o   Runs 3 days

o   Moves 300 terabytes

o   Produces 200 terabytes

·         Believe they can gain another 30 to 50% improvement in run time.


Computing in the cloud with Hadoop

·         Christophe Bisciglia: Google open source team

·         Jimmy Lin: Assistant Professor at University of Maryland

·         Set up a 40 node cluster at UofW.

·         Using Hadoop to help students and academic community learn the map/reduce programming model.

·         It’s a way for Google to contribute to the community without open sourcing Map/Reduce

·         Interested in making Hadoop available to other fields beyond computer science

·         Five universities in program: Berekeley, CMU, MIT, Stanford, UW, UMD

·         Jimmy Lin shows some student projects including a statistical machine translations project that was a compelling use of Hadoop.

·         Berkeley will use Hadoop in their introductory computing course (~400 students).


Panel on Future Directions:

·         Five speakers from the Hadoop community:

1.       Sanjay Radia

2.       Owen O’Malley (Yahoo & chair of Apache PMC for Apache)

3.       Chad Walters (Powerset)

4.       Jeff Eastman (Mahout)

5.       Sameer Paranjpye

·         Yahoo planning to scale to 5,000 nodes in near future (at 2k servers now)

·         Namespace entirely in memory.  Considering implementing volumes. Volumes will share data. Just the volumes will be partitioned.  Volume name spaces will be “mounted” into a shared file tree.

·         HoD scheduling implementation has hit the wall.  Need a new scheduler. HoD was a good short term solution but not adequate for current usage levels.  It’s not able to handle the large concurrent job traffic Yahoo! is currently experiencing.

·         Jobs often have a large virtual partition for the maps. Because they are held during reduce phase, considerable resources are left unused.

·         FIFO scheduling doesn’t scale for large, diverse user bases.

·         What is needed to declare Hadoop 1.0: API Stability, future proof API to use single object parameter, add HDFS single writer append, & Authentication (Owen O’Malley)

·         Malhout project build classification, clustering, regression, etc. kernels that run on hadoop and release under commercial friendly, Apache license.

·         Plans for HBase looking forward:

1.       0.1.0: Initial release

2.       0.2.0: Scalability and Robustness

3.       0.3.0: Performance


James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Tuesday, March 25, 2008 11:51:51 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Monday, March 24, 2008

X-Tracing Hadoop: Andy Konwinski

·         Berkeley student with the Berkeley RAD Lab

·         Motivation: Make Hadoop map/reduce jobs easier to understand and debug

·         Approach: X-trace Hadoop (500 lines of code)

·         X-trace is a path based tracing framework

·         Generates an event graph to capture causality of events across a network.

·         Xtrace collects: Report label, trace id, report id, hostname, timestamp, etc.

·         What we get from Xtrace:

o   Deterministic causality and concurrency

o   Control over which events get traced

o   Cross-layer

o   Low overhead (modest sized traces produced)

o   Modest implementation complexity

·         Want real, high scale production data sets. Facebook has been very helpful but Andy is after more data to show the value of the xtrace approach to Hadoop debugging.  Contact if you want to contribute data.


ZooKeeper: Benjamin Reed (Yahoo Research)

·         Distributed consensus service

·         Observation:

o   Distributed systems need coordination

o   Programmers can’t use locks correctly

o   Message based coordination can be hard to use in some applications

·         Wishes:

o   Simple, robust, good performance

o   Tuned for read dominant workloads

o   Familiar models and interface

o   Wait-free

o   Need to be able to wait efficiently

·         Google uses Locks (Chubby) but we felt this was too complex an approach

·         Design point: start with a file system API model and strip out what is not needed

·         Don’t need:

o   Partial reads & writes

o   Rename

·         What we do need:

o   Ordered updates with strong persistence guarantees

o   Conditional updates

o   Watches for data changes

o   Ephemeral nodes

o   Generated file names (mktemp)

·         Data model:

o   Hiearchical name space

o   Each znode has data and children

o   Data is read and written in its entirety

·         All API take a path (no file handles and no open and close)

·         Quorum based updates with reads from any servers (you may get old data – if you call sync first, the next read will be current as of the point of time when the sync was run at the oldest.  All updates flow through an elected leader (re-elected on failure).

·         Written in Java

·         Started oct/2006.  Prototyped fall 2006.  Initial implementation March 2007.  Open sourced in Nov 2007.

·         A Paxos variant (modified multi-paxos)

·         Zookeeper is a software offering in Yahoo whereas Hadoop


Note: Yahoo is planning to start a monthly Hadoop user meeting.


James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Monday, March 24, 2008 11:50:52 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback

JAQL: A Query Language for Jason

·         Kevin Beyer from IBM (did the DB2 Xquery implementation)

·         Why use JSON?

o   Want complete entities in one place (non-normalized)

o   Want evolvable schema

o   Want standards support

o   Didn’t want a DOC markup language (XML)

·         Designed for JSON data

·         Functional query language (few side effects)

·         Core operators: iteration, grouping, joining, combining, sorting, projection, constructors (arrays, records, values), unesting, ..

·         Operates on anything that is JSON format or can be transformed to JSON and produces JSON or any format that can be transformed from JSON.

·         Planning to

o   add indexing support   

o   Open source next summer

o   Adding schema and integrity support


DryadLINQ: Michael Isard (Msft Research)

·         Implementation performance:

o   Rather than temp between every stage, join them together and stream

o   Makes failure recovery more difficult but it’s a good trade off

·         Join and split can be done with Map/Reduce but ugly to program and hard to avoid performance penalty

·         Dryad is more general than Map/Reduce and addresses the above two issues

o   Implements a uniform state machine for scheduling and fault tolerance

·         LINQ addresses the programming model and makes it more access able

·         Dryad supports changing the resource allocation (number of servers used) dynamically during job execution

·         Generally, Map/Reduce is complex so front-ends are being built to make it easier: e.g. PIG & Sawzall

·         Linq: General purpose data-paralle programming contructs

·         LINQ+C# provides parsing, thype-checking, & is a lazy evaluator

o   It builds an expression tree and materializes data only when requested

·         PLINQ: supports parallelizing LINQ queries over many cores

·         Lots of interest in seeing this code out there in open source and interest in the community to building upon it.  Some comments very positive about how far along the work is matched with more negative comments on this being closed rather than open source available for other to innovate upon.


James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Monday, March 24, 2008 11:47:35 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback

PIG: Web-Scale Processing

·         Christopher Olston

·         The project originated in Y! Research.

·         Example data analysis task: Find users that visit “good” web pages.

·         Christopher points out that joins are hard to write in Hadoop and there are many ways of writing joins and choosing a join technique is actually a problem that requires some skill.  Basically the same point made by the DB community years ago.  PIG is a dataflow language that describes what you want to happen logically and then map it to map/reduce.  The language of PIG is called Pig Latin

·         Pig Latin allows the declaration of “views” (late bound queries)

·         Pig Latin is essentially a text form of a data flow graph.  It generates Hadoop Map/Reduce jobs.

o   Operators: filter, foreach … generate, & group

o   Binary operators: join, cogroup (“more customizable type of join”), & union

o   Also support split operator

·         How different from SQL?

o   It’s a sequence of simple steps rather than a declarative expression.  SQL is declarative whereas Pig Latin says what steps you want done in what order.  Much closer to imperative programming and, consequently, they argue it is simpler.

o   They argue that it’s easier to build a set of steps and work with each one at a time and slowly build them up to a complete and correct language.

·         PIG is written as a language processing layer over Map/Reduce

·         He propose writing SQL as a processing layer over PIG but this code isn’t yet written

·         Is PIG+Hadoop a DBMS? (there have been lots of blogs on this question :-))

o   P+H only support sequential scans super efficiently (no indexes or other access methods)

o   P+H operate on any data format (PIGS eat anything) whereas DBMS only run on data that they store

o   P+H is a sequence of steps rather than a sequence of constraints as used in DBMS

o   P+H has custom processing as a “first class object” whereas UDFs were added to DBMSs later

·         They want an Eclipse development environment but don’t have it running yet. Planning an Eclipse Plugin.

·         Team of 10 engineers currently working on it.

·         New version of PIG to come out next week will include “explain” (shows mapping to map/reduce jobs to help debug).

·         Today PIG does joins exactly one way. They are adding more join techniques.  There aren’t explicit stats tracked other than file size.  Next version will allow user to specify. They will explore optimization.


James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Monday, March 24, 2008 11:46:32 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback

Yahoo is hosting a conference the Hadoop Summit down in Sunnyvale today. There are over 400 attendees of which more than ½ are current Hadoop users and roughly 15 to 20% are running more than 100 node clusters.


I’ll post my rough notes from the talks over the course of the day.  So far, it's excellent. My notes on the first two talks are below.




Hadoop: A Brief History

·         Doug Cutting

·         Started with Nutch in 2002 to 2004

o   Initial goal was web-scale, crawler-based search

o   Distributed by neciessity

o   Sort/merge based processing

o   Demonstrated on 4 nodes over 100M web pages. 

o   Was operational onerous. “Real” Web scale was a ways away yet

·         2004 through 2006: Gestation period

o   GFS & MapReduce papers published (addressed the scale problems we were having)

o   Add DFS and MapReduce to Nutch

o   Two part-time developers over two years

o   Ran on 20 nodes at Internet Archive (IA) and UW

o   Much easier to program and run

o   Scaled to several 100m web pages

·         2006 to 2008: Childhood

o   Y! hired Doug Cutting and a dedicated team to work on it reporting to E14 (Eric  Baldeschwieler)

o   Hadoop project split out of Nutch

o   Hit web scale in 2008


Yahoo Grid Team Perspective: Eric Baldeschwieler

·         Grid is Eric’s team internal name

·         Focus:

o   On-demand, shared access to vast pools of resources

o   Support massive parallel execution (2k nodes and roughly 10k processors)

o   Data Intensive Super Computing (DISC)

o   Centrally provisioned and managed

o   Service-oriented, elastic

o   Utility for user and researchers inside Y!

·         Open Source Stack

o   Committed to open source development

o   Y! is Apache Platinum Sponsor

·         Project on Eric’s team:

o   Hadoop:

§  Distributed File System

§  MapReduce Framework

§  Dynamic Cluster Management (HOD)

·         Allows sharing of a Hadoop cluster with 100’s of users at the same time.

·         HOD: Hadoop on Demand. Creates virtual clusters using Torq (open source resource managers).  Allocates cluster into many virtual clusters.

o   PIG

§  Parallel Programming Language and Runtime

o   Zookeeper:

§  High-availability directory and confuration service

o   Simon:

§  Cluster and application monitoring

§  Collects stats from 100’s of clusters in parallel (fairly new so far).  Also will be open sourced.

§  All will eventually be part of Apache

§  Similar to Ganglia but more configurable

§  Builds real time reports.

§  Goal is to use Hadoop to monitor Hadoop.

·         Largest production clusters are currently 2k nodes.  Working on more scaling.  Don’t want to have just one cluster but want to run much bigger clusters. We’re investing heavily in scheduling to handle more concurrent jobs.

·         Using 2 data centers and moving to three soon.

·         Working with Carnegie Mellon University (Yahoo provided a container of 500 systems – it appears to be a Rackable Systems container)

·         We’re running Megawatts of Hadoop

·         Over 400 people express interest in this conference.

o   About ½ the room running Hadoop

o   Just about the same number running over 20 nodes

o   About 15 to 20% running over 100 nodes


James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |  | blog:


Monday, March 24, 2008 11:45:30 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback

I’m long been a big fan of modular data centers using ISO standard Shipping containers as the component building block:

Containers have revolutionized shipping and are by far the cheapest way to move good over sea, land, rail or truck. I’ve seen them used to house telecommunications equipment, power generators, and even stores and apartments have been made using them:


The datacenter-in-a-box approach to datacenter design is beginning to be deployed more widely with Lawrence Berkeley National Lab having taken delivery of a Sun Black Box and a “customer in eastern Washington” having taken delivery of a Rackable Ice Cube Module earlier this year.


Last summer I came across a book on Shipping Containers by Marc Levinson: The Box: How the Shipping Container Made the World Smaller and the World Economy Bigger. It’s a history of containers from the early experiments in 1956 through to mega-containers terminals distributed throughout the world. The book doesn’t talk about all the innovative applications of containers outside of shipping but does give an interesting background on their invention, evolution, and standardization.



James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Monday, March 24, 2008 10:42:00 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
 Tuesday, March 18, 2008

Earlier today I viewed Steve Jobs 2005 Commencement Speech at Stanford University. In this talk Jobs recounts three stories and ties them together with a common theme.  The first was dropping out of Reed College and showing up for the courses he wanted to take rather than spend time on those he had to take. Dropping out was a tough decision but some of what he learned in these audited courses had a fundamental impact on the Mac and, in retrospect, appeared to be a good decision or at least one that lead to a good outcome.  Getting fired from Apple was the second.  Clearly not his choice, not what he would have wanted to happen but it lead to Pixar, Next and rejoining Apple stronger and more experienced than before. Again, a tough path but one that may have lead to a better overall outcome. Likely he is a better and more capable leader for the experience.  Finally, facing death. Death awaits us all and, when facing death, it becomes clear what really matters.  It becomes clear that following your heart is what is really important. Don’t be trapped by Dogma, don’t live other people’s lives, and have the courage to follow your own intuition. Clearly nobody wants to approach to death but knowing it is coming can free each of us to realize we can’t hide, we don’t have forever, and those things that scare us most are really tiny and insignificant when compared with death. Facing death can free us to take chances and to do what is truly important even if success looks uncertain or the risk is high.


The theme that wove these three stories together and Jobs parting words for his listeners was to “stay hungry and stay foolish”.

It’s a good read:  Or you can view it at:


Sent my way by Michael Starbird-Valentine.




James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | |

Tuesday, March 18, 2008 11:55:59 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

<April 2008>

This Blog
Member Login
All Content © 2015, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton