Wednesday Yahoo announced they have a built a petascale, distributed relational database. In Yahoo Claims Record With Petabyte Database, the details are thin but they built on the PostgreSQL relational database system. In Size matters: Yahoo claims 2-petabyte database is world's biggest, busiest, the system is described as an over 2 petabyte repository of user click stream and context data with an update rate for 24 billion events per day. Waqar Hasan, VP of Engineering at Yahoo! Data group, describes the system as updated in real time and live – essentially a real time data warehouse where changes go in as they are made and queries always run against the most current data. I strongly suspect they are bulk parsing logs and the data is being pushed into the system in large bulk units but, even near real time at this update rate, is impressive.
The original work was done at a Seattle startup called Mahat Technologies acquired by Yahoo! in November 2005.
The approach appears to be similar to what we did with IBM DB2 Parallel Edition. 13 years ago we had it running on a cluster of 512 RS/6000s at the Maui Super Computer Center and 256 nodes at the Cornel Theory Center. It’s a shared nothing design which means that each server in the cluster have independent disk and don’t share memory. The upside of this approach is it scales incredibly well. It looks like Yahoo! has done something similar using PostgreSQL as the base technology. Each node in the cluster runs a full copy of the storage engine. The query execution engine is replaced with one modified to run over a cluster and use a communications fabric to interconnect the nodes in the cluster. The parallel query plans are run over the entire cluster with the plan nodes interconnected by the communication fabric. The PostgreSQL client, communications protocol and server side components with some big exceptions run mostly unchanged. The query optimizer is either replaced completely with a cluster parallel aware implementation that models the data layout and cluster topology in making optimization decisions. Or the original, non-cluster parallel optimizer is used and the resultant single node plans are then optimized for the cluster in a post optimization phase. The former will yield provably better plans but it’s also more complex. I’m fearful of complexity around optimizers and, as a consequence, I actually prefer the slightly less optimal, post-optimization phase. Many other problems have to be addressed including having the cluster metadata available on each node to support SQL query compilation but what I’ve sketched here covers the major points required to get such a design running.
The result is a modified version of PostgreSQL runs on each node. A client can connect to any of the nodes in the cluster (or a policy restricted subset). A query flows from the client to the server it chose to connect with. The SQL compiler on that node compiles and optimizes the query on that single node (no parallelism). The query optimizer is either cluster-aware or uses a post-optimization cluster-aware component. The resultant query plan when ready for execution is divided up into sub-plans (plan fragments) that run on each node connected over the communication fabric. Some execution engines initiate top-down and some bottom up. I don’t recall what PostgreSQL uses but bottom-up is easier in this case. However, either can be made to work. The plan fragments are distributed to the appropriate nodes in the cluster. Each runs on local data and pipes results to other nodes which run plan fragments and forward the results yet again toward the root of the plan. The root of the plan runs on the node that started the compilation and the final results end up there to be returned to the client.
It’s a nice approach and as evidenced by Yahoo’s experience it scales, scales, scales. I also like the approach in that most tools and applications can continue to work with little change. Most clusters of this design have some restrictions such unique ID generation is either not supported or slow as is referential integrity. Nonetheless, a large class of software can be run without change.
If you are interested in digging deeper into Relational Database technology and how the major commercial systems are written, see Architecture of a Database System.
Yahoo has a long history of contributing to Open Source and they are the largest contributor to the Apache Hadoop project. It’ll be interesting to see if Yahoo! Data ends up open source or held as an internal only asset.
Kevin Merritt pointed me to the Yahoo! Data work.
-jrh
James Hamilton, Windows Live Platform Services Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Search drives the online commerce world by bringing sellers and buyers together. As a seller, you most important task is getting your site to rank high organically and to have your advertisements placed most prominently and most frequently to user interested in buying and only to users interested in your product. A buyer chooses a search engine on the basis of more reliably getting them to what they are looking for. And, with commercial queries, getting them to the “best” seller where best is a fairly complex and hard to define term in this context. Happy buyers keep using the search engine and paying the sellers. Sellers who manage their organic and paid placements correctly sell lots of product. Successful search engines make considerable profit. That’s just the way the ecosystem has evolved – it’s the broadly used search engine that has all the influence and so they end up with considerable profit.
What if the rules changed? What if some of the search engine profit was returned to users? Could this change the ecosystem and could it be a good thing? Let’s watch because Microsoft is about to announce a “cash back service” later today according to Search Engine Land. In this posting, Playing with Live Cashback, the blog author demonstrates using the Live Cashback system and concludes that it won’t have much impact. I’m less certain. I suspect that respecting users and returning some value to them will change this market in positive way. It’ll be fun to watch over the next 4 to 6 weeks and see how the search ecosystem evolves.
--jrh
James Hamilton, Windows Live Platform Services Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
There is no question that cloud computing is going to a big part of the future of server-side systems. What I find interesting is the speed with which this is happening. Look at recent network traffic growth rates from AWS:

From: http://aws.typepad.com/aws/2008/05/lots-of-bits.html
AWS is now consuming considerably more bandwidth than Amazon’s global web sites. Phenomenal growth and impressive absolute size.
Continuing to look at growth, I saw a chart a few weeks back on the Amazon Web Services Blog that illustrates the value of a pay-as-you-go and pay-as-your-grow service. This chart shows the number of EC2 servers in use by Animoto over a couple of week period. Note the explosion in EC2 server usage in the three day period from 4/15 through 4/18 and imagine trying to do capacity planning for Animoto. They went from roughly 50 servers to needing more than 3,500 in three days. Imagine having to predict growth and get servers racked, stacked and online in time to meet the growth. Nearly impossible.

From: http://aws.typepad.com/aws/2008/04/animoto---scali.html (Emre Kiciman sent it my way).
When you next hear “why web services?”, think of this chart.
Another point I hear frequently around web services is, “sure, they are used by start-ups but REAL enterprises would never use them due to security and data privacy reasons.” Again, utter bunk but it’s a frequently repeated quip. I led the Exchange Hosted Services team and we provided hosted email anti-malware and archiving. The service was originally targeting small and medium sized businesses and many from those categories did use it. But, what was interesting was the number of name-brand, world-wide enterprises that recognized the cost and quality advantages of using hosting services. Valuable internal enterprise resources are best saved for tasks that add value to the business.
Perhaps the large enterprises will use hosted email services but what about low level services such as EC2 and S3? Again, it’s the same story. If the value is there, companies of all sizes will use it. From the Amazon 4th quarters earnings call, TechCrunch reports (Who Are The Biggest Users of Amazon Web Services? It’s Not Startups):
So who are using these services? A high-ranking Amazon executive told me there are 60,000 different customers across the various Amazon Web Services, and most of them are not the startups that are normally associated with on-demand computing. Rather the biggest customers in both number and amount of computing resources consumed are divisions of banks, pharmaceuticals companies and other large corporations who try AWS once for a temporary project, and then get hooked.
Big companies are jumping in as well.
Google recently entered the cloud computing market with Google Application Engine. They are only a couple months in beta and report they have allowed in 60,000 developers in that short period of time. The amazing thing is the apparent size of the back log. The forums are full of people complaining that they can’t yet get on (Sriram Krishnan sent my way).
Wired recently published “Cloud Computing. Available at Amazon.com Today”.
It’s unusual for a new model to grow so fast and it’s close to unprecedented to see so much early growth in the enterprise. However, when the potential savings are this large, big things can happen.
--jrh
James Hamilton, Windows Live Platform Services Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
I’ve been involved with high scale systems software projects, mostly database engines, for the last 20 years and I’ve watched the transition from low level and proprietary languages to C. Then C to C++. Recently I’ve been thinking a bit about what’s next.
Back in the very early 90’s when I was Lead Architect on IBM DB2, I was dead against C++ usage in the Storage Engine and wouldn’t allow exceptions to be used anywhere in the system. At the time, the quality of C++ compilers was variable with some being real compilers that were actually fairly well done (I lead the IBM RS/6000 C++ team in the late 80s) while others were Cfront-based and pretty weak. At the time no compiler, including the one I worked on, did a good job implementing exceptions. Times change. SQL Server, for example, is 100% C++ and it makes excellent use of exception to clean up resources on failure.
The productivity benefits of new programming languages and tools eventually wins out. When they get broad use, implementations improve reducing the performance tax and, eventually, even very performance sensitive system software make the transition.
I got interested in Java in the mid-90’s and more recently I’ve been using C# quite a bit partly due to where I work and partly because I actually find the language and surrounding tools impressively good. JITed languages typically don’t perform as well as statically compiled languages but the advantages completely swamp the minor performance costs. And, as managed language (Java, C#, etc.) implementations improve, the performance tax continues to fall. There is no question in my mind that managed languages will end up being broadly used in even the most performance critical software systems such as database engines.
Recently, I’ve gotten interested in Erlang as an systems software implementation language. By most measures, it looks to be an unlikely choice for high scale system software in that its interpreted, has a functional subset at its core, and uses message passing rather than shared memory and locks. Basically, it’s just about the opposite of everything you would find in a modern commercial database kernel. So what makes it interesting? The short answer is all the things that make it an unlikely choice also make it interesting. Servers are becoming increasingly unbalanced with CPU speeds continuing to outpace memory and network bandwidth. More and more operations are going to be memory and network bound rather than CPU if they aren’t already. Trading some CPU resources to get a more robust implementation that is easier to understand and maintain is a good choice. In addition, CPU speed increases are now coming more from multiple cores than from frequency scaling a single core. Consequently a language that produces an abundance of parallelism is a an asset rather than a problem. Finally, large systems software projects like database management systems, operating systems, web servers, IM servers, email systems, etc. are incredibly large and complex. The Erlang model of spawning many lightweight threads that communicate via message passing is going to be less efficient than the more common shared memory and locks solution but it’s much easier to get correct. Erlang also encourages a “fail fast” programming model. I’ve long argued that this is the only way to get high scale systems software correct (Designing and Deploying Internet-Scale Services).
Certainly Erlang brings a tax as have other new languages that we have adopted over the years. But, it also bring some of what we need badly right now. For example, the fail fast programming model is the right one and, when combined with synchronous state redundancy, is how most high-scale systems should be written. Erlang also encourages the production of a very large number of threads which can be a good thing on very high core count servers. Message passing rather than shared memory with locks and fail fast with operation restart significantly increases the probability of the software system working correctly through unexpected events.
From my perspective, the syntax of Erlang is less than beautiful but all the advantages above make up for most of that.
The Concurrency and Coordination Runtime is a .Net runtime that implements some of the features I mention above for languages like C#. George Chrysanthakopoulos, Microsoft CCR Architect, reports that MySpace is using it: MySpace.com using the CCR (Sriram Krishnan pointed me to this one).
It appears that Erlang usage is ramping up fairly quickly right now. Naturally, since it was developed there, Erlang is used by many Ericsson projects including the AXD301 ATM Switch and the AXE line of switches. The AXD series includes over 850k lines of Erlang. However, outside of Ericsson some very interesting examples are emerging. Amazon’s SimpleDB is written is Erlang (Amazon SimpleDB is built on Erlang and What You Need To Know About Amazon SimpleDB). The recently released (quietly) Facebook Chat application uses Erlang as well (Dare Obasanjo sent that one my way). CouchDB is written Erlang as well (CouchDB: Thinking beyond the RDBMS). Some more Erlang applications from the Erlang FAQ:
Is it time for a new server-side implementation language?
--jrh
James Hamilton, Windows Live Platform Services Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
I’ve spent a big part of my life working on structured storage engines, first in DB2 and later in SQL Server. And yet, even though I fully understand the value of fully schematized data, I love full text search and view it as a vital access method for all content wherever it’s stored. There are two drivers of this opinion: 1) I believe, as an industry, we’re about ¼ of the way into a transition from primarily navigational access patterns to personal data to ones based upon full text search, and 2) getting agreement on broad, standardizing schema across diverse user and application populations is very difficult.
On the first point, for most content on the web, full text search is the only practical way to find it. Navigational access is available but it’s just not practical for most content. There is simply too much data and there is no agreement on schema so more structured searches are usually not possible. Basically structured search is often not supported and navigational access doesn’t scale to large bodies of information. Full text search is often the only alternative and it’s the norm when looking for something on the web.
Let’s look at email. Small amounts of email can be managed by placing each piece of email you chose to store in a specific folder so it can be found later navigationally. This works fine but only if we keep only a small portion of the email we get. If we never bothered to throw out email or other documents that we come across, the time required to folderize would be enormous and unaffordable. Folderization just doesn’t scale. When you start to store large amount of email or just stop (wasting time) aggressively deleting email, then the only practical way to find most content is full text search. As soon as 5 to 10GB of un-folderized and un-categorized personal content is accumulated, it’s the web scenario all over again: search is the only practical alternative. I understand that this scenario is not supported or encouraged by IT or legal organizations at most companies but that is the way I chose to work. There is no technical stumbling block to providing unbounded corporate email stores and the financial ones really don’t stand up to scrutiny. Ironically most expensive, corporate email systems offer only tiny storage quotas while most free, consumer-based services are effectively unbounded. Eventually all companies will wake up to the fact that knowledge workers work more efficiently with all available data. And, when that happens, even corporate email stores will grow beyond the point of practical folderization.
The second issue was the difficulty of standardizing schema across many different stores and many different applications. The entire industry has wanted to do this over the past couple of decades and many projects have attempted to make progress. If they were widely successful, it would be wonderful but they haven’t been. If we had standardized schema, we would have quick and accurate access to all data across all participating applications. But it’s very hard to get all content owners to cooperate or even care. Search engines attempt to get to the same goal but they chose a more practical approach: they use full text search and just chip away at the problem. They work hard on ranking. They infer structure in the content where possible and exploit it where it’s found. Where structure can’t be found, at least there is full text search with reasonably good ranking to full back upon.
Strong or dominant search engine providers have considerable influence over content owners and weak forms of schema standardization becomes more practical. For example, a dominate search engine provider can offer content owners opportunities to get better search results for their web site if they supply a web site map (standard schema showing all web pages in site). This is already happening and web administrators are participating because it brings them value. A web sites ranking in the important search engine providers is very vital and a chance to lift your ranking even slightly is worth a fortune. Folks will work really hard where they have something to gain. So, if adopting common schema can improve ranking, there is significant chance something positive actually could happen.
The combination of providing full text search over all content and then motivating content providers to participate in full or partial schema standardization coupled with the search engine inferring schema where it’s not feels like a practical approach to richer search. I love full text search and view it as the under-pinning to finding all information structured or not. The most common queries will include both structured and non-structured components but the common element will be that full schema standardization isn’t required nor is it required that a user understand schema to be able to find what they need. Over time, I think we will see incremental participation in standardized schemas but this will happen slowly. Full text search with good ranking and relevance assisted by whatever schema can be found or inferred in the data will be the under-pinning to finding most content over the near term.
--jrh
James Hamilton, Windows Live Platform Services Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Some time back I got a question on what I look for when hiring a Program Manager from the leader of a 5 to 10 person startup. I make no promise that what I look for is typical of what others look for – it almost certainly is not. However, when I’m leading an engineering team and interviewing for a Program Manager role, these are the attribute I look for. My response to the original query is below:
The good news is that you’re the CEO not me. But, were our roles reversed, I would be asking you why you think you need PM at this point? A PM is responsible for making things work across groups and teams. Essentially they are the grease that helps make a big company be able to ship products that work together and get them delivered through a complicated web of dependencies. Does a single product startup in the pre-beta phase actually need PM? Given my choice, I would always go with more great developer at this phase of the companies life and have the developers have more design ownership, spend more time with customers, etc. I love the "many hats" model and it's one of the advantages of a start-up. With a bunch of smart engineers wearing as many hats as needed, you can go with less overhead and fewer fixed roles, and operate more efficiently. The PM role is super important but it’s not the first role I would staff in a early-stage startup.
But, you were asking for what I look for in a PM rather than advice on whether you should look to fill the role at this point in the company’s life. I don't believe in non-technical PMs, so what I look for in PM is similar to what I look for in a developer. I'm slightly more willing to put up with somewhat rusty code in a PM, but that's not a huge difference. With a developer, I'm more willing to put up with certain types of minor skill deficits in certain areas if they are excellent at writing code. For example, a talented developer that isn’t comfortable public speaking, or may only be barely comfortable in group meetings, can be fine. I'll never do anything to screw up team chemistry or bring in a prima donna but, with an excellent developer, I'm more willing to look at greatness around systems building and be OK with some other skills simply not being there as long as their absence doesn't screw-up the team chemistry overall. With a PM, those skills need to be there and it just won't work without them.
It's mandatory that PMs not get "stuck in the weeds". They need to be able to look at the big picture and yet, at the same time, understand the details, even if they aren't necessarily writing the code that implements the details. A PM is one of the folks on the team responsible for the product hanging together and having conceptual integrity. They are one of the folks responsible for staying realistic and not letting the project scope grow and release dates slip. They are one of the team members that need to think customer first, to really know who the product is targeting, to keep the project focused on that target, and to get the product shipped
So, in summary: what I look for in a PM is similar to what I look for in a developer (http://mvdirona.com/jrh/perspectives/2007/11/26/InterviewingWithInsightAtMicrosoft.aspx) but I'll tolerate their coding possibly being a bit rusty. I expect they will have development experience. I'm pretty strongly against hiring a PM straight out of university -- a PM needs experience in a direct engineering role first to gain the experience to be effective in the PM role. I'll expect PMs to put the customer first and understand how a project comes together, keep it focused on the right customer set, not let feature creep set in, and to have the skill, knowledge, and experience to know when a schedule is based upon reality and when it's more of a dream. Essentially I have all the expectations of a PM that I have of a senior developer, except that I need them to have a broad view of how the project comes together as a whole, in addition to knowing many of the details. They must be more customer focused, have a deeper view of the overall project schedules and how the project will come together, be a good communicator, perhaps a less sharp coder, but have excellent design skills. Finally, they must be good at getting a team making decisions, moving on to the next problem, and feeling good about it.
--jrh
James Hamilton, Windows Live Platform Services Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
I forget what brought it up but sometime back Sriram Krishnan forwarded me this article on Mike Burrows and his work through Dec, Microsoft, and Google (The Genius: Mike Burrows' self-effacing journey through Silicon Valley). I enjoyed the read. Mike has done a lot over the years but perhaps his best known works of recent years are Alta Vista at DEC and Chubby at Google.
I first met Mike when he was at Microsoft Research. He and Ted Wobber (also from Digital) came up to Redmond to visit. Back then I led the SQL Server relational engine development team which included the full text search index support. I was convinced then, and still am today, that relational database engines do a good job of managing structured data but a poor job of the other 90 to 95% of the data in the world that is less structured. It just seems nuts to me that customers industry-wide are spending well over $10B a year on relational database management systems and yet only being able to effectively use these systems to manage a tiny fraction of their data. As an increasing fraction of the structured data in the world is already stored in relational database managements systems, industry growth will come from helping customers manage their less structured data.
To be fair, most RDMBS (including SQL Server) do support full text indexing but what I’m after is deep support for full text where the index is a standard access method rather than a separate indexing engine on the side and, more importantly, full statistics are tracked on the full text corpus allowing the optimizer to make high quality decisions on join orders and techniques that include full text indices.
If you haven’t read Mike’s original Chubby paper, do that: http://labs.google.com/papers/chubby.html. Another paper is at: http://labs.google.com/papers/paxos_made_live.html. Chubby is an interesting combination of name server, lease manager, and mini-distributed file system. It’s not the combination of functionality that I would have thought to bring together in a single system but it’s heavily used and well regarded at Google. Unquestionably a success.
--jrh
James Hamilton, Windows Live Platform Services Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
The years of Moore’s law growth without regard to power consumption are now over. On the data center side, power isn’t close to the largest cost of running a large service but it is one of the largest controllable costs and it has been in the press frequently of late. On the client side, battery power is the limiting factor.
It is worth understanding what devices consume the most power since most laptops provide some form of user control. Most systems allow LCD backlight dimming, the CPU power consumption can be lowered (a combination of factors including reducing clock speed and voltage), wireless radios can be switched off, and disks activity can be curtailed or eliminated. Where does the power go?
The data below was measured by Mahesri and Vardhan with an Thinkpad R40 as the system under test:
|
Device |
Standby |
Minimum |
Maximum |
|
CPU |
|
11.3W |
25.5W |
|
CD-R/RW, DVD |
0.0W |
2.8W |
5.3W |
|
LCD Backlight |
|
0.6W |
3.5W |
|
Wireless (802.11) |
0.1W |
1.0W |
3.1W |
|
HDD (40GB@4,200RPM) |
0.2W |
0.6W |
2.8W |
|
LCD |
|
0.9W |
1.0W |
Data from: http://www.crhc.uiuc.edu/~mahesri/classes/project_report_cs497yyz.pdf.
The dominant consumer by a significant factor is the CPU. This power consumption is, of course, very load dependent particularly in multi-core systems where the spread between |