Friday, December 14, 2007

The number 1 Amazon AWS requirement just got met: structured storage.  Amazon announced SimpleDB yesterday although it’s not yet available for developers to play with.  I’m looking forward to being able to write a simple application against it – I’ve had fun with S3. But, for now, the docs will have to do.

 

In the announcement (http://aws.amazon.com/simpledb), it’s explained that SimpleDB is not a relational DB.  AWS notes that some customers run relational DBs in EC2 and, those that need complex or strictly enforced schema, will continue to do this.  Others that only need a simple structured store with much less administrative overhead will use SimpleDB.  AWS explains that they will make it increasingly easy to do both.

 

The hard part of running a RDBMS in EC2 is that there is no data protection.  The EC2 local disk is 100% ephemeral.  Some folks are using block replicators such as DRBD (http://www.drbd.org/) to keep two MySQL systems in sync.  It’s kind of a nice solution but requires some skill to set up.  When AWS says “make it easier” I suspect they are considering something along these lines.  A block replicator would be a wonderful EC2 addition.  However, for those that really only need a simple structured store, SimpleDB is (almost) here today.

 

My first two interests are data model and pricing.  The data model is based upon domains, items, attributes and values.  A domain roughly corresponds to a database and you are allowed up to 100 domains.  All queries are within a domain.  Within a domain, you can create items.  Each item has an ID (presumably unique).  Every Item has attributes and attributes have values.  You don’t need to (and can’t) declare the schema of an item and any item can have any number of attributes up to 256 per item.  Attributes have values and the values can repeat.  So a given item may have an attribute or may not and it can have the attribute more than once.  Attributes have values and value are simple UTF-8 strings limited to 1024 bytes.   All attributes are indexed.

 

The space overhead gives a clue to storage format:

·         Raw byte size (GB) of all item IDs + 45 bytes per item +

·         Raw byte size (GB) of all attribute names + 45 bytes per attribute name +

·         Raw byte size (GB) of all attribute-value pairs + 45 bytes per attribute-value pair

 

The storage format appears to be a single ISAM index of (item, attribute, value).  It wouldn’t surprise me if the index used in SimpleDB is the same code that S3 uses for metadata lookup.

 

The query language is respectable and includes: =, !=, <, > <=, >=, STARTS-WITH,  AND, OR, NOT, INTERSECTION AND UNION.  Queries are resource limited to know more than 5 seconds of execution time.

 

The storage model, like S3, is replicated asynchronously across data centers with the good and the bad that comes with this approach: the data is stored geo-redundantly which is wonderful but it is possible to update a table and, on a subsequent request, not even see your own changes.  The consistency model is very weak and the storage reliability is very strong. I actually like the model although most folks I talk to complain that it’s confusing.  Technically this is true but S3 uses the same consistency model.  I’ve spoken to many S3 developers and never heard a complaint (admittedly some just don’t understand it).

 

Pricing was my second interest. There are three charges for SimpleDB:

·         Machine Utilization: $0.14/machine hour

·         Data Transfer:

o    $0.10 per GB - all data transfer in

o    $0.18 per GB - first 10 TB / month data transfer out

o    $0.16 per GB - next 40 TB / month data transfer out

o    $0.13 per GB - data transfer out / month over 50 TB

·         Structured Storage: $1.50 GB/month

 

$1.50 GB/month or $18 GB/year is 10x the $1.80 GB/year charged by S3.  Fairly expensive in comparison to S3 but a bargain compared to what it cost to manage an RDBMS and the hardware that supports it.

 

More data on SimpleDB from AWS: http://aws.amazon.com/simpledb.

 

Sriram Krishnan has an excellent review up at: http://www.sriramkrishnan.com/blog/2007/12/amazon-simpledb-technical-overview.html

 

Thanks to Sriram Krishnan (Develop Division) and Dare Obasanjo (WinLive Platform Services) for sending this one my way.  I’m really looking forward to playing with SimpleDB and seeing how customers end up using it.  Overall, I find it tastefully simple and yet perfectly adequate for many structured storage tasks.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Friday, December 14, 2007 10:57:22 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Wednesday, December 12, 2007

SpecPOWER is the first industry standard benchmark that evaluates power and performance characteristics of high volume servers.  Yesterday the final spec was released (see below thanks to Kushagra Vaid of GFS H/W Architecture & Standards). 

 

I have quibbles with this benchmark but they really are mostly quibbles.  Generally, I’m thrilled to see a benchmark out there that shines a light on server power consumption.  The entire industry needs it. Benchmarks drive action and we have lots of low hanging fruit to grab when it comes to server power/performance.

 

The press release is up at: http://www.spec.org/power_ssj2008/ and early results are at http://www.spec.org/power_ssj2008/results/power_ssj2008.html.

 

I remember working on IBM DB2 years back when we first ran TPC-A. Prior to the benchmark we had a perfectly good database management system that customers were using successfully to run some very large businesses. When we first ran TPC-A, let’s just say we didn’t like what we saw.  We got to work and ended up improving DB2 by a full factor of 10 in that release and then went on and improved by a further factor of 4 in the next release.  Yes I know the only way to improve that much is to start off REALLY needing it.  That’s my point. As a direct result of the benchmark giving us and customers increased visibility into OLTP performance, the product improved 40x in less than 5. Customers gained, the product got better, and it gave the engineering team good goals to rally around. Benchmarks help customers.

 

As benchmarks age, it’s harder to find the big differentiating improvements. Once the easy changes are found and even the more difficult improvements have been worked through, benchmark specials typically begin to emerge.  Companies start releasing changes that help the benchmarks and do nothing and, in rare cases, can even hurt real customer workloads. Eventually, this almost always happens and causes industry veterans, myself included, to distrust benchmarks. We forget that in the early years of most benchmarks, they really did help improve the product and delivered real value to customers.

 

I’m very happy to see SPEC release a benchmark that measures power and I’m confident that it will help drive big power efficiency gains. Sure, we’ll eventually see the game playing benchmark specials but, for now, I predict many of the improvements will be real. This will help the industry evolve more quickly.  Now can someone please start working on a data center efficiency benchmark?  Huge innovations are waiting in data center design and even more in how existing designs are deployed.

 

                                                --jrh

 

 

From: osgmembers-request@spec.org [mailto:osgmembers-request@spec.org] On Behalf Of Alan Adamson
Sent: Tuesday, December 11, 2007 8:44 AM
To: osgmembers@spec.org
Subject: (osgmembers-363) SPECpower benchmark availability

 


I am delighted to announce the public availability of the initial SPECpower benchmark, SPECpower_ssj2008 - you can find the initial press release here http://www.spec.org/power_ssj2008/ , and see initial results here :
http://www.spec.org/power_ssj2008/results/power_ssj2008.html

This benchmark release is the result of a long and strenuous process (I cannot recall a committee meeting four days a week for several hours each day), needed partly because of the new ground being plowed.  

I expect that all forthcoming benchmark development groups will be thinking about how to incorporate power measurements, and that the power committee will be looking at extending the scope of its workloads, and also helping other committees learn from their experience.

And now, on to several other benchmarks in or heading for general membership review.  This is a pretty exciting time for the OSG!

My thanks and congratulations to those who worked so hard on this initial SPECpower benchmark.

Alan Adamson                                                                        905-413-5933  Tieline 969-5933   FAX: 905-413-4854
OSG Chair, SPEC                                                                  Internet:
adamson@ca.ibm.com
Java Technology Centre
IBM Toronto Lab

 

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Wednesday, December 12, 2007 10:56:15 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Tuesday, December 11, 2007

Google announced a project in October 2006 to install 9,212 solar panels on the roof of its headquarters complex.  Currently over 90% of these panels are installed and active. They expect the installation will produce 30% of the peak power requirements of the headquarters buildings. For example, as I post this, they report 0.6 MW have been produced over the last 24 hour period.  The roughly ½ megawatt is arguably in the noise when compared with the Google total data center power consumption which remains a carefully guarded secret.  With high 10’s of data centers world-wide each consuming in the 10WM range, the 0.6 produced on the roof of the HQ building is a very small number.  Nonetheless, 0.6MW is MUCH better than nothing and it’s great to see these resources committed to making a difference.

 

The real time power production report for this installation can be found at: http://www.google.com/corporate/solarpanels/home.  Thanks to Lewis Curtis for sending this my way.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Tuesday, December 11, 2007 10:53:11 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Sunday, December 09, 2007

I’ve long argued that the firm and clear division between development and operations common in many companies is a mistake.  Development doesn’t feel the pain and understand what it takes to make their services more efficient to operate.  Operations tends to hire more people to deal with the mess.  It doesn’t work, it isn’t efficient, and it’s slow.  The right model has the development/ops line very blurred. The Amazon model takes this to an extreme with essentially a “you wrote it, you run it” approach. This is nimble and delivers the pain right back where it can be most efficiently solved: development.  But it can be bit tough on the engineering team.  The approach I like most is a hybrid, where we have a very small tier 1 support team operations with everything else going back to development.

 

However, if I had to err to one extreme or the other, I would head down the Amazon path.  A clear division between ops and dev leads to an over-the-wall approach that is too slow and too inefficient. What’s below is more detail on the Amazon approach, not so much because they do it perfectly but because they represent an extreme and therefore are a good data point to think though.  Also note the reference to test-in-production found in the last paragraph.  This is 100% the right approach in my opinion.

 

                                                --jrh

 

When I was at Amazon (and I don’t think anything has changed) services were owned soup to nuts by the product team. There were no testers and no operations people. Datacenter operations, security, and operations were separate, centralized groups.

 

So ‘development’ was responsible for much of product definition and specification, the development, deployment and operation of services. This model came down directly from Jeff Bezos; his intention was to centralize responsibility and take away excuses. If something went wrong it was very clear who was responsible for fixing it. Having said that, Amazon encouraged a no-blame culture. If things went wrong the owner was expected to figure out the root cause and come up with a remediation plan.

 

This had many plusses and a few negatives. On the positive side:

·         Clear ownership of problems

·         Much shorter time to get features into production. There was no hand-off from dev to test to operations.

·         Much less requirement for documentation, (both a plus and a minus)

·         Very fast response to operational issues, since the people who really knew the code were engaged up-front.

·         Significant focus by developers on reliability and operability, since they were the people responsible for running the service

·         Model works really well for new products

Negatives:

·         Developers have to carry pagers. On-call rotations, with people on-call required to respond to a sev-1 within 15 minutes of being paged 24 hours a day. Could lead to burn-out.

·         Ramp up curve for new developers is extremely steep because of the lack of documentation and process

·         For some teams the operations load completely dominated, making it very difficult to tackle new features or devote time to infrastructure work.

·         Coordinating large feature development across multiple teams is tough.

 

The one surprise in this is that in my opinion code quality was as good as or better than here, despite having no testers. People developed an approach of deploying to a few machines and observing their behavior for a few days before starting a large roll out. Code could be reverted extremely quickly.

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh 

Sunday, December 09, 2007 10:51:48 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Thursday, December 06, 2007

Michael Hunter, who authors the Testing and Debugging blog at Dr. Dobb’s Journal, asked me for an interview on testing related topics some time back. I’ve long lamented that, industry-wide, there isn’t nearly enough emphasis on test and software quality assurance innovation. For large projects, test is often the least scalable part of the development process.  So, when Michael offered me a platform to discuss test more broadly, I jumped on it. 

 

Michael structures these interviews, and his subsequent blog entry, around five questions.  These ranged from where I first got involved in software testing, through the most interesting bug I’ve run into, what has most surprised me about testing, what’s the most important thing for a tester to know, and what’s the biggest challenge facing the test discipline over the next five years.

 

Michael’s interview is posted at: http://www.ddj.com/blog/debugblog/archives/2007/12/five_questions_39.html.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Thursday, December 06, 2007 10:50:14 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Wednesday, December 05, 2007

Mike Zintel (Windoes Live Core) sent this one my way.  It’s a short 2:45 video that is not particularly informative but it is creative: http://www.youtube.com/watch?v=fi4fzvQ6I-o.

 

                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh 

Wednesday, December 05, 2007 10:46:39 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Tuesday, December 04, 2007

Amazon doesn’t release much about its inner workings which is unfortunate in that hey do some things notably well and often don’t get credit.  My view is that making some of these techniques more public would be a great recruiting tool for Amazon but I understand the argument for secrecy as well.

 

Ronny Kohavi recently pointed me to this presentation on A/B testing at Amazon.  It’s three years old but still well worth reading.  Key points from my perspective:

·         Amazon is completely committed to A/B testing.  In past Bezos presentations he’s described Amazon as a “data driven company”. One of the key advantages of a service is you get to see in real time how well it’s working.  Any service that doesn’t take advantage of this and is just making standard  best guess or informed expert decisions, is missing a huge opportunity and hurting their business.  The combination of A/B testing and cycling through ideas quickly does two wonderful things: 1) it makes your service better FAST, and 2) it takes the politics and influence out of new ideas.  Ideas that work win and, those that don’t show results, don’t get used whether proposed by a VP or the most junior web designers.  It’s better for the service and for everyone on the team.

·         The infrastructure focus at Amazon. Bezos gets criticized by Wall Street analyst for over investing in infrastructure but the infrastructure investment gives them efficiency and gives them pricing power which is one of their biggest assets.  The infrastructure investment also allows them to host third parties which gives Amazon more scale in a business where scale REALLY matters and it gives customers broader selection which tends to attract more customers.  Most important, it gives Amazon, the data driven company, more data and this data allows them to improve their service rapidly and give customers a better experience: “customers who bought X…”

·         Negative cost of capital: Slide 8 documents how they get a product on day 0, sell it on day 20, get paid on day 23, and pay the supplier on day 44.

·         Slide 7 shows what can be done with a great infrastructure investment: respectable margins and very high inventory turn rates.

 

The presentation is posted: http://ai.stanford.edu/~ronnyk/emetricsAmazon.pdf

 

                                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh 

 

Tuesday, December 04, 2007 7:23:10 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Friday, November 30, 2007

Some months back I finished a paper with Joe Hellerstein and Michael Stonebraker scheduled to be published in the next issue of Foundations and Trends of Databases.  This paper is aimed at describing how current generation database management systems are implemented.  I’ll post a reference to it here once it is published.

 

As very small part of this paper, we cover the process model used by Oracle, DB2, MySQL, SQL Server, and PostgreSQL.  A process model is how a database maps the work it’s doing on behalf of multiple concurrent users onto operating system processes and/or threads.  This is an important design choice in that it has fundamental impact on the number of concurrent requests that can be supported, development costs, maintainability, and code base portability amongst other issues. 

 

These same design choices are faced by most high scale server designers and is equally applicable to mail servers, web servers, app servers, and any other application needing to service large numbers of requests in parallel. Given the importance of the topic and it’s applicability to all multi-user server systems, it’s worth covering separately here.  I find it interesting to note that three of the leading DBMSs support more than one process model and one supports four variants. There clearly is no single right answer.

 

Summarizing the process models supported by IBM DB2, MySQL, Oracle, PostgreSQL, and Microsoft SQL Server:

 

1.      Process per DBMS Worker: This is the most straight-forward process model and is still heavily used today.  DB2 defaults to process per DBMS worker on operating systems that don’t support high quality, scalable OS threads and thread per DBMS worker on those that do.  This is also the default Oracle process model but they also supports process pool as described below as an optional model.  PostgreSQL runs the Process per DBMS Worker model exclusively on all operating system ports.

2.      Thread per DBMS Worker: This an efficient model with two major variants in use today:

a.       OS thread per DBMS Worker: IBM DB2 defaults to this model when running on systems with good OS thread support. This is the model used by MySQL as well.

b.      DBMS Thread per DBMS Worker: In this model DBMS Workers are scheduled by a lightweight thread scheduler on either OS processes or OS threads both of which are explained below. This model avoids any potential OS scheduler scaling or performance problems at the expense of high implementation costs, poor development tools and debugger support, and substantial long-standing maintenance costs.  There are two sub-categories of this model:

                                                              i.      DBMS threads scheduled on OS Process: a lightweight thread scheduler is hosted by one or more OS Processes.  Sybase uses this model and began with the thread scheduler hosted by a single OS process.  One of the challenges with this approach is that, to fully exploit shared memory multi-processors, it is necessary to have at least one process per processor.  Sybase has since moved to hosting DBMS threads over potentially multiple OS processes to avoid this limitation.  When DBMS threads within multiple processes, there will be times when one process has the bulk of the work and other processes (and therefore processors) are idle.  To make this model work well under these circumstances, DBMSs must implement thread migration between processes. Informix did an excellent job of this starting with the Version 6.0 release.  All current generation systems supporting this model implement a DBMS thread scheduler that schedules DBMS Workers over multiple OS processes to exploit multiple processors.

                                                            ii.      DBMS threads scheduled on OS Threads:  Microsoft SQL Server supports this model as a non-default option.  By default, SQL Server runs in the DBMS Workers multiplexed over a thread pool model (described below).  This SQL Server option, called Fibers, is used in some high scale transaction processing benchmarks but, otherwise, is in very light use.

3.      Process/Thread Pool: In this model DBMS workers are multiplexed over a pool of processes.  As OS thread support has improved, a second variant of this model has emerged based upon a thread pool rather than a process pool.  In this later model, DBMS workers are multiplexed over a pool of OS threads:

a.       DBMS workers multiplexed over a process pool: This model is much more efficient than process per DBMS worker, is easy to port to operating systems without good OS thread support, and scales very well to large numbers of users.  This is the optional model supported by Oracle and the one they recommend for systems with large numbers of concurrently-connected users.  The Oracle default model is process per DBMS worker.  Both of the options supported by Oracle are easy to support on the vast number of different operating systems they target (at one point Oracle supported over 80 target operating systems).

b.      DBMS workers multiplexed over a thread pool: Microsoft SQL Server defaults to this model and well over 99% of the SQL Server installations run this way. To efficiently support 10’s of thousands of concurrently connected users, SQL Server optionally supports DBMS threads scheduled on OS threads.

 

Most current generation commercial DBMSs support intra-query parallelism, the ability to execute all or parts of query in parallel. Essentially, intra-query parallelism is the temporary assignment of multiple DBMS workers to execute a SQL query.  The underlying process model is not impacted by this feature other a single client connection may, at times, have more than a single DBMS worker.

 

Process model selection has a substantial influence on DBMS scaling and portability. As a consequence, three of the most successful commercial systems each support more than one process model across their product line.   From an engineering perspective, it would clearly be much simpler to employ a single process model across all operating systems and at all scaling levels.  But, due to the vast diversity of usage patterns and the non-uniformity of the target operating systems, however, each DBMS has elected to support multiple models.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

 

Friday, November 30, 2007 5:48:30 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Monday, November 26, 2007

There are few things we do more important than interviewing and yet it’s done very unevenly across the company.  Some are amazing interviewers and others, well, I guess they write good code or something else J.

 

Fortunately, interviewing can be learned and, whatever you do and wherever you do it, interviewing with insight pays off.  Some time back, a substantial team was merged with the development group I lead and, as part of the merger,  a bunch of senior folks many of whom do As Appropriate (AA) interviews and all of which were frequently contributors on our interview loops.  The best way to get in sync on interviewing techniques and leveling is to talk about it so I brought us together several times to talk about interviewing, to learn from each other, and set some standards on how we’re going to run our loops.  In preparation for that meeting, I wrote up some notes of what I view as best practices for AAs but these apply to all interviewers and I typically send these out whenever I join a team. 

 

Some of these are specific to Microsoft but many apply much more broadly.  There is some internal Microsoft jargon used in the doc.  For example, at Microsoft the AA is short for “As Appropriate” and is the final decision making on whether an offer will be made. However, most of what is here is company invariant.

 

The doc: JamesRH_AA_Interview_NotesX.doc (37 KB).

 

I also pulled some of the key points into a ppt: AA Interview NotesX.ppt (156.5 KB).

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

 

Monday, November 26, 2007 5:44:24 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Process
 Friday, November 23, 2007

Ted Wobber (msft Research) brought together the following short list of SSD performance data.  Note the FusionIO part claiming 87,500 IOPS in a 640 GB package.  I need to run a perf test against that part and see if it's real.  It looks perfect for very hot OLTP workloads.

 

A directory of “fastest SSDs”:

http://www.storagesearch.com/ssd-fastest.html

                Note that this contains RAM SSDs as well as flash SSDs.  This list, however, seems to be ranked by bandwidth, not IOPs.

 

This manufacturer make a very high-end database accelerator:

http://www.stec-inc.com/technology/

Among the things that they do are:   most likely logical address re-mapping, way over-provisioning of free space, highly parallel ops

 

Then there are these guys who do the hard work in the host OS:

http://managedflash.com/home/index.htm

They clearly do logical address re-mapping, but their material is strangely devoid of mention of cleaning costs.  Perhaps they get “free” hints from the OS free block table.

 

Nevertheless, the following article is worth reading:

http://mtron.easyco.com/news/papers/easyco-flashperformance-art.pdf

 

FusionIO: http://fusionio.com/.

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

 

Friday, November 23, 2007 6:21:26 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Tuesday, November 20, 2007

Google has been hiring networking hardware folks so it’s been long speculated that they are building their own network switches.  This remains speculation only but the evidence is mounting:

 

From http://www.nyquistcapital.com/2007/11/16/googles-secret-10gbe-switch/ (Sent my way by James Depoy of the OEM team and Michael Nelson of SQL Server):

Through conversations with multiple carrier, equipment, and component industry sources we have confirmed that Google has designed, built, and deployed homebrewed 10GbE switches for providing server interconnect within their data centers.

We believe Google based their current switch design on Broadcom’s (BRCM) 20-port 10GE switch silicon (BCM56800) and SFP+ based interconnect. It is likely that Broadcom’s 10GbE PHY is also being employed. This would be a repeat of the same winner-take-all scenario that played out in 1GbE interconnect.

 

This article attempts to track Google consumption by tracking shipments of 20 port 10GigE silicone. Not a bad approach.  In their determination, Google is installing 5,000 ports a month. If you assume that servers dominate over inter-switch connections, the implication is that Google is installing nearly 5k servers per months.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

Tuesday, November 20, 2007 8:38:39 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Monday, November 19, 2007

Last week I attended and presented at USENIX LISA (http://www.usenix.org/event/lisa07/) conference. I presented Designing and Deploying Internet-Scale Applications and the slides are at: PowerPoint slides.

 

I particularly enjoyed Andrew Hume’s (AT&T) talk where he talked about the storage sub-systems used at AT&T research and the data error rates he’s been seeing over the last several decades and what he does about it.  His experience exactly parallels mine with more solid evidence and can be summarized by all layers in the storage hierarchy produce errors. The only way to store data for the long haul is with redundancy coupled with end-to-end error detection.  I enjoyed the presentations of Shane Knapp and Avleen Vig of Google in that they provided a small window into how Google takes care of their ~10^6 servers with a team of 30 or 40 hardware engineers world-wide, the software tools they use to manage the world’s biggest fleet and the releases processes used to manage these tools. Guido Trotter also of Google talked about how Google IT (not the production systems) were using Xen and DRDB to build a highly reliable IT systems.  He used DRDB (http://www.drbd.org/download.html) to do asynchronous, block level replication between a primary and a secondary.  The workloads runs in a Xen virtual machine and, on failure, is restarted on the secondary. Ken Brill, Executive Director of the Uptime Institute  made a presentation focused on power being the problem.  Ignore floor space cost, system density is not the issue, it’s a power problem. He’s right and it’s becoming increasingly clear each year.

 

 My rough notes from the sessions I attended are at:  JamesRH_Notes_USENIXLISA2007x.docx (21.03 KB).

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

 

Monday, November 19, 2007 7:06:36 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Friday, November 16, 2007

Three weeks ago I presented at HPTS (http://www.hpts.ws/index.html). HPTS is an invitational conference held every two years since 1985 in Asilomar California that brings together researchers, implementers, and users of high scale transaction processing systems.  It’s one of my favorite conferences in that it attracts a very interesting group of people, is small enough that everyone can contribute and there is lots of informal discussion in a great environment on the ocean near Monterey.

 

I presented Modular Data Center Design and Designing and Deploying Internet-Scale Services.  A highlight of this year’s session was a joint keynote address from David Patterson of Berkeley and Burton Smith of Microsoft.  Dave's slides are posted at DavidPattersonTechTrends2007.ppt (442.5 KB).  Burton's not in the office right now so I don't have access to his but will post them when I do.

 

I’m the General Chair for the 2009 HPTS which is scheduled to be October 25 through 28, 2009.  Keep the date clear and plan on submitting an interesting position paper to get invited.  If you are doing high scale data centric applications, HPTS is always fun.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

 

Friday, November 16, 2007 5:19:45 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Monday, November 12, 2007

For the last year or so I’ve been collecting Scaling Web Site war stories and I’ve been posting them to my Microsoft internal blog.  I collect them for two reasons: 1) scaling web site problems all center around persistent state management and I’m a database guy so the interest is natural, and 2) it’s amazing how frequently the same trend appears: design a central DB.  Move to functional partition. Move to a horizontal partition. Somewhere through that cycle, add caching at various levels.  Most skip the step hardware evolution of starting with scale-up servers and then moving to scale out clusters but even that pattern shows up remarkably frequently (e.g. eBay, and Amazon).

 

Scaling web site war stories:

·         Scaling Amazon: http://glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

·         Scaling Second Life: http://radar.oreilly.com/archives/2006/04/web_20_and_databases_part_1_se.html

·         Scaling Technorati: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/

·         Scaling Flickr: http://radar.oreilly.com/archives/2006/04/database_war_stories_3_flickr.html

·         Scaling Craigslist: http://radar.oreilly.com/archives/2006/04/database_war_stories_5_craigsl.html

·         Scaling Findory: http://radar.oreilly.com/archives/2006/05/database_war_stories_8_findory_1.html

·         MySpace 2006: http://sessions.visitmix.com/upperlayer.asp?event=&session=&id=1423&year=All&search=megasite&sortChoice=&stype=

·         MySpace 2007: http://sessions.visitmix.com/upperlayer.asp?event=&session=&id=1521&year=All&search=scale&sortChoice=&stype=

·         Twitter, Flickr, Live Journal, Six Apart, Bloglines, Last.fm, SlideShare, and eBay: http://poorbuthappy.com/ease/archives/2007/04/29/3616/the-top-10-presentation-on-scaling-websites-twitter-flickr-bloglines-vox-and-more

 

Thanks to Soumitra Sengupta for sending the Flickr and PoorButHappy pointer my way and to Jeremy Mazner for sending the MySpace references.

 

                                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

 

Monday, November 12, 2007 5:22:28 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Friday, November 09, 2007

Earlier in the week Dr. Tachi Yamada of the Bill and Melinda Gates Foundation presented the work they are doing on health care in developing countries.  Some years back Bill Gates gave a similar talk at Microsoft and it was an amazing presentation.  Partly due to the depth and breadth of Bill’s understanding of the world health care problem but what impressed me most was the effectiveness of applying business principles to a social problem.  Applying capital to the highest leverage opportunities.  Don’t just invest in the breakthrough but also in the social and political barriers to uptake.  Tailor the solution to the local environment.  Work on the supply chain.  Influence the economic factors that cause phara companies to invest in a given solution.

 

The same techniques that allow a company to find success in business can be applied to world healthcare.  I love the approach and Dr. Yamada’s talk this week followed a similar theme.  My rough notes follow.

 

                                                                                --jrh

 

·         Speaker: Dr. Tadataka (Tachi) Yamada

o   Excellent presentation. He quitely relays the facts without slides and just lays out a very compelling and very clear picture of their approach to health care.

·         About ½ the foundation focuses on health, ¼ on learning in the US, and ¼ on improving economic situation

·         1,000 babies will die during this talk.

·         Life expectancy: 50 in sub-Sahara and close to 80 here in North America

·         Bill “finally graduated” from Harvard last June and in his commencement address he said:

o   humanities great advancements are not the discovery of technology but the application of it to fight inequity.

·         $2T spent on healthcare in the US.  A few billion from Gates foundation won’t correct the lack of political will in how this is applied.  $2B will have a fundamental impact spent in the developing world. This is where we can have the greatest positive impact and that’s why the foundation focuses its healthcare resources in the developing world.

·         HIV battle is using prevention.  Lifetime cost of treatment makes it very expensive to battle via treatment.

o   Circumcision has been shown very effective in reducing the transmission of HIV.

o   Long term approach is vaccine (note that 25 years of research haven’t yet found this)

§  We’re investing $500m over 5 years in HIV vaccine research

·         We focus on all phases of taking science to improved health outcomes:

o   To science, then to local opinion, then to policy, and then to application.  Without cover all four, full impact will not be relized.

·         In developing world 70% of all care is private, often for profit, health care.

o   Individuals purchasing directly from pharmacies (e.g. Malaria treatment)

o   Basic point is that you need to understand the entire system (economics, policy, social factors, etc.)

·         Mass customization is required for global success in business AND also in not-for-profit. The same ideas apply.

·         Yamanda points out that bed nets are effective in the fight against Malaria but aren’t in heavy use. He shows how companies market products and argues that we need to do the same thing in public health care.  People have to want a treatment, people have to believe in it or it won’t work.

·         Peer reviews kill innovation.  Need innovators reviewing innovation. Standard peer review tends to seek out incremental improvements to existing systems. 

·         10m children lose their lives each year.  Must stay focused on the prize: reduced mortality.

·         Quote from one of his ex-managers: “If you aren’t keeping score, you are just practicing”

o   Metrics driven approaches are needed

·         Birth rates: 30% lack of control and 70% demand side problem.

·         We believe that a healthy pharmaceutical industry and believe in IP but need affordable prices in under developed world.

·         Pharma makes less than 1% of the profits in the developing world.  Selling at cost would drive volume and not impact the profit picture.

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

 

Friday, November 09, 2007 5:55:30 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Tuesday, November 06, 2007

I wrote this back in March of 2003 when I lead the SQL Server WebData team but it’s applicability is beyond that team. What’s below, is a set of Professional Engineering principles that I’ve built up over the years. Many of the concepts below are incredibly simple and most are easy to implement but it’s a rare team that does them all.

 

More important than the specific set of rules I outline below is to periodically stop, think in detail about what’s going well and what isn’t; think about what you want to personally do differently and what you would like to help your team do differently. I don’t do this as often as I should – we’re all busy with deadlines looming – but, each time I do, I get something significant out of it.

 

The latest word document is stored at: http://mvdirona.com/jrh/perspectives/content/binary/ProfessionalEngineering.docx and the current version is inline below.  Send your debates and suggestions my way.

 

                                    --jrh


 

James Hamilton, Windows Live Platform Services
Bldg RedW-C/1279, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

 

Professional Engineering

James Hamilton, 2003.03.13

Update: 2007.03.09

 

·          Security and data Integrity: The data business is about storing, managing, querying, transforming, and analyzing customer data and, without data integrity and security, we would have nothing of value to customers.  No feature matters more than data integrity and security and failures along either of these two dimensions are considered the most serious. Our reputation with our customers and within our company is dependent upon us being successful by this measure, and our ability to help grow this business is completely dependent upon it.

·          Code ownership: Every line of code in the system should have a non-ambiguous owner and there should be a back-up within that department. Generally, each team should be organized roughly along architectural boundaries avoiding components spread over many or multiple teams. There should be no files "jointly" owned by multiple teams. If necessary, split files to get non-ambiguous ownership.

·          Design excellence: Utilize the collective expertise of the entire team and, where appropriate, experience from other teams. Design for the long haul. Think through cross-component and inter-feature interactions before proceeding. Never accept quick hacks that we can't maintain over the long term and don't rush something out if it can't be made to work. A good general rule: "never promise what you don't know how to do." Of course, all designs must be peer reviewed.

·          Peer review: Peer review is one of the most important aspects of the engineering process and it's through code and design review that we get more IQ involved on all important and long lasting decisions. All designs and all code changes must be reviewed before being checked in. Make sure your reviewer has the understanding of the code needed to a good job and don't accept rushed or sloppy reviews. Reviewing is a responsibility and, those that do an excellent job deserve and will get special credit. All teams should know who their best reviewers are and should go to them when risk or complexity is higher than normal. When there are problems, I will always ask who reviewed the work. Reviewing is a responsibility we all need to take seriously.

·          Personal integrity: It's impossible for a team to function and be healthy if we're not honest with each other and especially with ourselves. We should be learning from our mistakes and improving constantly and this simply isn't possible unless we admit our failures. If a commitment is made, it should be taken seriously and delivered upon.  And, when things go wrong, we need to be open about it.

·          Engineering process clarity: The engineering process including Development, PM, and Test should be simple and documented in sufficient detail that a new team member can join the team and be rapidly effective. It should be maintained and up-to-date and, when we decided to do something differently, it'll be documented here.

·          Follow-through and commitment to complete: In engineering, the first 10% of the job is often the most interesting while the last 10% is the most important. Professional engineering is about getting the job done. Completely.

·          Schedule integrity: Schedules on big teams are often looked at as "guidance" rather than commitments. Schedules, especially those released externally, are taken seriously and, as a consequence, external commitments need to be more carefully buffered than commitments made within your team. One of the best measures of engineering talent is how early scheduling problems are detected, admitted, and corrected. Ensure that there is sufficient time for "fit and finish" work. Ensure that the spec is solid early. Complete tests in parallel. Don't declare a feature to be done until at least 70% of the planned functional tests are passing (a SQL Server specific metric that I believe was originally suggested by Peter Spiro), and the code is checked in. Partner with dependent components for early private testing. When a feature is declared done, there should be very few bugs found subsequently, and none of these should be obvious.

·          Code base quality: Code owners are expected to have a multiple release plan for where the component is going. Component owners need to understand competitors, current customer requirements, and are expected to know where the current implementation is weak and have a plan to improve it over the next release or so.  Code naturally degrades over time and, without a focus on improvement, it becomes difficult to maintain over time. We need to invest 15 to 20% of our overall team resources in code hygiene. It's just part of us being invested in winning over multiple releases. We can’t afford to get slowed or wiped out by compounding code entropy as Sybase was.

·          Contributing to and mentoring others: All members of the team bring different skills to the team and all of us have an obligation to help others grow. Leads and more experienced members of the team should be helping other team members grow and gain good engineering habits. All team members have a responsibility to help others get better at their craft and part of doing well in this organization is in helping the team as a whole become stronger. Each of us have unique skills and experiences -- looks for ways to contribute and mentor other members of the team.

·          QFEs: must be on time and of top quality. QFEs are one of the few direct contacts points we have with paying customers and we take them very seriously prioritizing QFEs above all other commitments. Generally, we put paying customer first. When a pri-1 QFE comes in, drop everything and, if necessary, get help. When a pri-2 or Pri-3 comes in, start within the next one or two days at worst. Think hard about QFEs -- don't just assume that what is requested represents what the customer needs nor that the solution proposed is the right one. We intend to find a solution for the customer but we must choose a fix that we can continue to support over multiple releases. Private QFEs are very dangerous and I'm generally not in support of them. Almost invariably they lead to errors or regressions in a future SP or release. The quality of QFEs can make or break a customer relationship and regressions in a "fix" absolutely destroy customer confidence.

·          Shipped quality: This one is particularly tough to measure but it revolves around a class of decision that we have to make every day when we get close to a shipment: did we allow ourselves enough time to be able to fix bugs that will have customer impact or were we failing and madly triaging serious bugs into the next release trying to convince ourselves that this bug "wasn't very likely" (when I spend time with customers I'm constantly amazed at what they actually do in their shops – just about everything is likely across a sufficiently broad customer base). And, there's the flip side, did we fix bugs close to a release that destabilized the product or otherwise hurt customer satisfaction. On one side, triaging too much and on the other not enough and the only good way out of the squeeze is to always think of the customer when making the decision and to make sure that you always have enough time to be able to do the right thing.

·          Check-in quality: The overall quality of the source tree impacts the effectiveness and efficiency of all team members. Check-in test suites must be maintained, new features should get check-in test suite coverage, and they must run prior to checking in. To be effective, check-in tests suites can't run much longer than 20 to 40 minutes so, typically, additional tests are required. Two approaches I've seen work in the past: 1) gauntlet/snap pre-checkin automation, or 2) autobuilder post-checkin testing.

·          Bug limits: Large bug counts hide schedule slippage and the bugs count represents a liability that must be paid before shipping and large bug counts introduce a prodigious administrative cost. Each milestone, leads need to triage bugs and this consumes resources of productive members of the team that could be moving the product forward rather than taking care of the bug base. We will set limits for max number of bugs carried by each team and limits that I've used and found useful in the past are: each team limits active defects to less than 3 times the number of engineers on the team and no engineer should carry more than 5 active defects.

·          Responsibility: Never blame other teams or others on your team for failures. If your feature isn't coming together correctly, it's up to you to fix it. I never want to hear that test didn't test a feature sufficiently, the spec was sloppy, or the developer wasn't any good. If you own a feature, whether you work in Test, Dev, or PM, then you are responsible for the feature being done well and delivered on time. You own the problem. If something is going wrong in some other part of the team and that problem may prevent success for the feature, find a solution or involve your lead and/or their manager. “Not my department.” is not an option.

·          Learn from the past: When work is complete or results come in, consider as a team what can be learned from these results. Post mortems are a key component of healthy engineering. Learn to broadly apply techniques that work well and take quick action when we get results back that don't meet our expectations.

·          Challenge without failure: A healthy team should be giving all team members new challenges and pushing the limits for everyone. However, to make this work, you have to know when you are beyond your limits and before a problem is no longer solvable, get help. Basically, everyone should step to the plate but, before taking the last strike, get your lead involved. If that doesn't work, get their manager involved. Keep applying rule until success is found or the goal doesn't appear to be worth achieving.

·          Wear as many hats as needed: On startups, everyone on the team does whatever is necessary for the team to be successful and, unfortunately, this attribute is sometimes lost on larger, more mature teams. If testing is behind, become a tester. If the specs aren’t getting written, start writing. Generally, development can always out-pace test and sometimes can run faster than specs can be written. So self regulate by not allowing development to run more than a couple of weeks ahead of test (don’t check in until 70% of the planned tests are passing) and, if works needs to be done, don’t wait – just jump in help regardless of what discipline is in short supply.

·          Treat other team members with respect: No team member is so smart as to be above treating others on the team with respect. But do your homework before asking for help – show respect for the time of the person whose help you are seeking.

·          Represent your team professionally: When other teams ask questions, send notes, or leave phone messages ensure that they get quality answers. It’s very inefficient to have to call a team three times to get an answer and it doesn’t inspire confidence nor help teams work better together. Take representing your team seriously and don’t allow your email quotas to be hit or phone messages to go unanswered.

·          Customer Focus: Understand how customers are going to use your feature. Ensure that it works in all scenarios, with all data types, and supports all operating modes. Avoid half done features. For example, don’t add features to Windows that won’t run over Terminal Server and don’t add features to SQL server that don’t support all data types. Think about how a customer is going to use the feature and don’t take the easy way out and add a special UI for this feature only. If it’s administrative functionality, ensure that it is fully integrated into the admin UI and has API access consistent with the rest of the product. Avoid investing in a feature but not in how a customer uses the feature. For example, in SQL Server there is a temptation to expose new features as yet another stored procedure rather than adding full DDL and integrating into the management interface.

·         Code Serviceability & Self Test: All code should extensively self check.  Rather than simple asserts, a central product or service wide component should handle error recording and reporting.  On failure, this component is called.  Key internal structures are saved to disk along with a mini-dump and stack trace.  This state forms the core of the Watson return data and the central component is responsible for sending data back (if enabled).  Whether or not Watson reporting is enabled, the last N failures should be maintained on disk for analysis. There are two primary goals: 1) errors are detected early and before persistent state is damaged and 2) sufficient state is written to disk that problem determination is possible on the saved state alone and no-repro is required.  SQL Server helped force this during  the development of SQL Server 2005 by insisting that all failures during system test yield either 1) a fix based upon the stored failure data, or 2) a bug opened against the central bug tracking agent to record more state data to allow this class of issues to be more fully understood if it happens subsequently.  If a customer calls service, the state of the last failure is recorded and can be easy sent in without asking the customer to step through error prone data acquisition steps and without asking for a repro. 

·         Direct Customer feedback: Feedback directed systems like Watson and SQM are amazingly powerful and are strongly recommended for all products.

·         Ship often and incrementally: Products that ship frequently stay in touch with their customers, respond more quickly to changes in the market to changes in competitive offerings.  Shipping infrequently, tends to encourage bad behavior in the engineering community where partly done features are jammed in and V1 ends up being a good quality beta test rather than a product ready for production.  Infrastructure and systems should ship every 18 months, applications at least every 12 months, and services every three months.

·         Keep asking why and polish everything: It’s easy to get cynical when you see things going wrong around you and, although I’ve worked on some very fine teams, I’ve never seen a product or organization that didn’t need to improve.  Push for change across the board.  Find a way to improve all aspects of your product and don’t accept mediocrity anywhere. Fit and finish comes only when craftsman across the team care about the entire product as a whole. Look at everything and help improve everywhere.  Don’t spend weeks polishing your feature and then not read the  customer documentation carefully and critically.  Use the UI and API even if you didn’t write it and spend time thinking of how it or your feature could be presented better or more clearly to customers.  Never say “not my department” or “not my component” … always polish everything you come near.

 

 

 

 

ProfessionalEngineering.docx (19.79 KB)
Tuesday, November 06, 2007 4:55:16 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Process
 Saturday, November 03, 2007

Last week Hillary Clinton presented at Microsoft to a sold out crowd of roughly 2,000 people.  Jennifer Hamilton attended and sent her notes my way.

 

                                                                --jrh

 

o    About 2000 people

o    Speech similar to one given on Monday night with a bit more a technology focus

·         US has always been the "Innovation Nation"--a hallmark of how country was founded and has grown

·         Can't assume it will stay that way--have to ask the hard questions and build

·         Don't think we're doing a good job--want to seize the mantle of innovation

·         Important not just for our industry but for the country

·         Innovation has fueled the opportunities of those born here and those who came here

o    4 big goals:

1.       Restore American leadership in the world

2.       Rebuild a strong and prosperous middle class

3.       Reform government to competence and more results-oriented

4.       Reclaim the future for our children and our dreams

o    For each of the four goals, she has set specific goals for what she would do as president

o    Spoke of Sputnik being a defining moment in her childhood

·         At that time America was the leader in everything

·         Then Sputnik and called into question

·         Had a republican pres that didn't blame the dems but went after the problem

·         Wants to do that same sort of thing

1.       Restore American leadership in the world

·         Partly its Iraq but this not the only international problem the next president will inherit

·         Our strategic/economic/innovation position eroding -- Clinton will restore the bi-partisan balance on end an era of "cowboy diplomacy"

·         Can't be a leader if no-one is following

o   All the problems we have, global-warming, g-terrorism, g-economics, we can't solve on our own

2.       Rebuild a strong and prosperous middle class

·         Economy has worked well for some of us, but hasn’t for many.

·         People struggling to maintain middle-class lifestyle.

·         Feel invisible to their government.

·         Feel their standing on trap-door--one misstep from disaster

·         Environmental a big part: we import more foreign oil post-9/11 than before

·         Take away tax-subsidy from oil companies to put towards alternative energies

·         Health-care (joked it’s an issue she has a "little experience in")--need a system of shared responsibility and choices

o    Insurance companies will have to change--she's offering them a new business model--they've made a lot of money not insuring people

·         50B spent in underwriting to avoid coverage plus more unproductive costs arguing on coverage

·         Big push towards electronic records for medical records

·         One of big problems in Katrina is how many records were lost

·         Wants to create a framework to give us private, confidential, secure electronic records

o    Also need to pay for prevention--insurance companies won’t

o    And manage chronic conditions

o    All added up will reduce costs and cover everyone

·         Improve education--it hasn't advanced either

o    Need to make college affordable and offer cheaper loans

o    Harder to go to university than 30 yrs ago

o    75% of students are from top 25% of income

o    Only 3% from bottom 25%

3.       Reform government to competence and to be results-oriented

·         We have been building a two-tier system

·         Tax system tilted towards top income

·         US was #1 for internet access 6 yrs ago--now 14th-25th depending on survey

·         Got to end Bush's muzzling of science

o    As president first thing will do is issue executive order to not interfere with science and lift the ban on ethical stem-cell research

·         End cronyism and appoint qualified people--re Katrina

4.       Reclaim the future for our children and our dreams

·         Don't want to be part of 1st gen of Americans who leave their country worse than when they found it.

·         Thrilled at idea of being first women president, but not running because is female. She is running because she feels she is the best-qualified

·         Not interested in all the personal attacks--am an expert on it  --have been recipient for over 15 yrs--that won't educate a child

·         Wants people to think that our best years are still ahead of us

 

James Hamilton, Windows Live Platform Services
Bldg RedW-C/1279, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

 

Saturday, November 03, 2007 12:37:26 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Thursday, November 01, 2007

Shankar Pal of SQL Server went to VLDB this year and passed his notes my way.  Find them here: http://www.mvdirona.com/jrh/perspectives/content/binary/ShakarPal_VLDB2007.docx.

 

Key points from my perspective:

·         Werner Vogels

o   Amazon able to lose a data center without missing SLA (note that this would also allow them to bring a down data center for service and implies they don’t need backup power or other datacenter-level redundancy – this can potentially save 20% of the total cost of a data center. I don’t know if they are exploiting this capability)

o   SLAs are two-way: a commitment to deliver a certain quality of service one way and a commitment the other way to deliver no more than a specified load

o   Amazon has implemented their services as a cluster of services. Services can scale up a single node at a time (elastic computing).  All data access is through the services. 

o   Repeats Stonebraker’s “One size doesn’t fit all” in databases.

·         Eric Brewer:

o   Founder of Inktolmi and Berkeley DB researcher

o   Discussed work he is doing in the third world.

·         Surajit Chaudhuri and Vivek Narasayya presented a retrospective on self-tuning database management systems

·         Michael Stonebraker

o   Presented “the end of an era: It’s time for a rewrite” and essentially argued that the current set of “elephants” in DB2, SQL Server, and Oracle are optimized for OLTP in a small memory world.  Outside of OLTP, these products are a poor fit and, even in OLTP, large memory systems make their disk I/O optimizations much less relevant.

o   He argues to get rid of redo log by keeping many different copies (I’m not ready to get rid of the redo log but I totally agree on the base point)

o   Mike still doesn’t buy that eventual consistency is the right model for high scale distributed systems

 

The conference proceedings is at:

http://sqlserver/projects/clouddb/Conferences/Forms/AllItems.aspx?RootFolder=%2fprojects%2fclouddb%2fConferences%2fVLDB%20%28Int%27l%20Conf%20on%20Very%20Large%20Data%20Bases%29%202007%2fVLDB%202007%20Proceedings%2fVLDB%202007%20Proceedings&FolderCTID=&View=%7b3CB64B8A%2d1B85%2d45AE%2d91B6%2d4063E886D023%7d

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-C/1279, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

 

ShakarPal_VLDB2007.docx (29.92 KB)
Thursday, November 01, 2007 4:52:39 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Tuesday, October 30, 2007

Jacek Becla of the Stanford Linear Accelerator (SLAC http://www.slac.stanford.edu/) team held a 1 day workshop on October 25th focused on Extremely Large Databases (http://www-conf.slac.stanford.edu/xldb07/). The goal was to look at “practical issues related to extremely large databases that push beyond the current commercial state of the art”.  SLAC has built some enormous DBs in the past including Babar (currently over 2 Petabytes), and they are working on a Large Synoptic Survey Telescope (http://www.lsst.org/lsst_home.shtml) that is expected to produce an O(100) petabyte DB.

 

The 57 attendees were from industry (Google, Yahoo!, Microsoft, IBM, MySQL, Oracle, Teradata, Vetica, Objectivity, Greenplum) and academia (Stonebraker and Dewitt), National Labs (PNNL, LLNL, and Oak Ridge), and also included many astronomers and high-energy physicist. The attendee list is here: http://www-conf.slac.stanford.edu/xldb07/listAllParticipants.asp.

 

What I found was most notable is, 5 years ago, high-energy physics and astronomers had the largest databases in the world by several order of magnitude whereas today, industry appears to be catching up. For example, the AT&T call detail and email IP storage is over 1.2 petabytes.

 

The second thing I found interesting was the Google presentation where they talked about the MySQL cluster behind the advertising system.  I found this interesting for two reasons: 1) they are using MySQL rather than Big Table, and 2) they have two engineers full time on Innodb maintenance.  The Google speaker didn’t give firm numbers but said there were “100s to 1000s of servers” in the MySQL cluster and the cluster is both geo-distributed and geo-redundant.  5 full-time DBAs manage the cluster.

 

My rough notes from the workshop follow.

 

Examples of Future of Large Scale Scientific Databases:

·         Large Hadron Collider (LHC) -- Dirk Duellmann, CERN IT

·         15 PB each year (after discarding through filtering)

o   Metadata is roughly 1/3 of a petabyte

·         100s thousands of CPU for analysis

·         200 computer centers with 12 large centers

·         Most of the data filesystem based

o   1995 Object DBs

o   2001 forward: OODBs dying so move to RDBMS + files

o   Using Oracle and MySQL for metadata with all data in filesystem

o   <5% metadata

o   Note that the mixed model is much more administratively expensive.

o   Using compression

·         Focus today is on data management rather than analysis but the collider is not yet live so this may change.

·         Dirk really wants more data under database management

·         Analysis phase is never overwrite (read-only and produce new data)

·         OS RHEL 5

·         CERN and Tier 1 storage in Oracle RAC 10g (4-way), Tier 2 is MySQL & SQLight

o   100 MB/S IOs per cluster at present (expect to be able to grow 5x as it goes live)

o   0.5 TB RAM

o   Moving to quad core and 64bit

o   5 DBA for all the storage

o   3 9s achieved

·         Deployment issues

o   Power and UPS

o   Increasing CPU power per box and more disk per server

o   JBOD and Oracle ASM today

o   Many hardware problems in commodity systems

o   Oracle patching issues (some security patches don’t support rolling upgrade – getting better)

o   Global system monitoring is difficult

o   Software licensing and not all sites upgrading at the same time.

o   Note: DB is exposed directly to internet

·         During analysis:

o   DBs get in the way and B-trees don’t help much

o   Typical queries “select … where v1>4 and v2>5 and …. And V99>3”

§  Bit map indexes would help but very space intensive

o   Large data sets >(10^9) input data sets

 

Extremely Large Database in Astronomy – LSST Kian-Tat Lim

·         Large Synoptic Survey Telescope (will be placed in Chile)

o   Will cover the entire night sky twice a week

·         Assets:

o   8.4m mirror

o   3.2 gigapixel camera (wow!)

·         Looking for dark matter and energy

·         Store images in FS and metadata in DBs

·         Most of the data is append only

·         How big when completed:

o   49B objects

o   2.8 trillion source

o   Expected to hit 14PB by 2024 (5.5PT data/rest indices)

o   2669 columns/object (growing) [object is astronomical object]

o   56 columns/source  [source is a particular observation of an astronomical object]

o   Believe that the system is comparable to commercial systems in complexity and size (assuming commercial systems continue to grow)

·         Never modifies raw data (databases are updated)

·         RAW data is never modified and constantly reprocessed with detailed provenance tracking

·         Plan to release a new data version (raw data and current processing) once a year.

·         Note ½ the science will come out from real time alerts of changes (10 to 60 second latency)

·         Expected to upgrade systems constantly so want portable code and preferably open source

·         Three replicated data access centers and one archive center (geo-distributed copies)

·         Lots of select * access so not clearly a win to go column oriented

·         Execution plan:

o   Map/reduce over DBS and FSs

·         Want:

o   procedural primitives (stored procs) in the DB

o   Relax consistency requirements

o   Wants fault tolerant software rather than expensive big iron

 

Academic Panel Notes:

·         55PB image data in FS & 20 PB metadata in DB

·         Computation does not fit into DB support today

o   Model: select data, do pixel-based calcs (as much as 10^11 cals per data point)

·         Most data is write-once, read-money (metadata like averages does get updated)

·         Need support for:

o   Spatial types (native rather than extender support)

o   Vector types

o   Array types

o   Often approx queries would be helpful to test hypothesis

·         Data access distribution:

o   Statistical astronomy: want to dig into large portion of all data

o   10^10 objects scanned to find data region of interest

 

Industry panel notes:

Google

·         MySQL DB used in advertising (traditional database application)

·         Shard and replicate

·         100s of 1000s of systems in clusters

·         QPS is incredibly large

·         Commodity hardware

·         Constrain the query model to allow scale-out

·         They have a couple of engineers on innodb engineering at Google

·         95% of load from querying (not transactional load) – very replicatable load

·         Geo-replicated and within the data center replicated for scaling

·         5 DBAs on this project (RDBMSs are not loved at Google)

·         Said that need SQL DB for OLTP apps … big table not appropriate for this. Need real time replication into Big Table

·         BigTable is used for Analysis.

 

AOL:

·         DB behind message board

·         TB scale

·         Need to be always up

·         Using Oracle, Sybase, PostgreSQL (200TB project), and MySQL (small install)

·         Geo-replicated and geo-hosted close to user

 

AT&T Research:

·         Call detail and email IP storage

·         Data stored raw and in DB form

·         About 1.2 PB data

·         Used for billing support, law enforcement, marketing, analysis

·         Used a proprietary DB called Daytona

·         Need to load and query simultaneously

·         Write once, read-many DB (never delete)

 

Ebay

·         2 large scale instances of over a PB in analytical DBs

·         24 hour a day query workload with concurrent load

·         8M queries/day

·         Index ratio: only 2% overhead. Mostly full scans

·         Expect system data storage size to triple over next 6 months

·         Storage is not the problem – it’s IO throughput

·         20TB between two instances 1000s of kilometers apart

·         6 to 8b records/day loaded (100B/day available to load but can’t to it)

·         Mixed workloads: loads, transforms, and queries in parallel

·         Weekly full system image backups with concurrent updates and queries

·         9m SQL requests to day (Teradata)

 

Yahoo

·         Operational data stores:

o   No adhoc query

o   Partitioned

o   Very low latency

·         Warehouse:

o   Production load

o   Multi-year analysis

o   Proprietary, custom, column-based data store

·         Business unit data

o   Often Oracle hosted

·         Map/reduce workload

o   Hadoop based with 2,000 nodes

o   Fairly new system

o   They are collaborating on HBASE (DB over Hadoop)

·         Stream Processing:

o   “Don’t let the data touch the disk”

o   25B events per day / 25TB per day

·         Note: commodity systems but using NetApp file stores

 

General Notes from Industry panal:

·         Ebay and AOL using SANs, Yahoo using NetApp, all others using DAS

·         Yahoo planning to move off NetAPP

·         Ebay using Teradata with dirty read as most commonly used consistency model

o   Lots of table scans and very few indexs

o   Piggy backed scans are very important to them

·         All speakers went real time access to new data coming in. Batch load warehouses don’t work in general

o   To achieve this, few indexes can be used

·         Everyone on commodity hardware, everyone on commodity disk or trying to get there (ebay on enterprise disk)

·         All using compression and most I/O bound

·         Can’t use sampling when looking for low probability events – need full scans for needle in haystack

o   Note: Some Bigtable are based upon samples

·         “Designing for the unknown query”

o   Fast scans and low effectiveness of indexes driven by unpredictable query load and the need for real time loading

·         Monitor database usage and optimize for evolving usage patterns

·         Google comment: “We want parallel data management but not parallel SQL (too restrictive). Big vendors need to adapt or Hadoop will”

·         Stonebraker observation: Industry is spending MUCH more on data management than Academic research

·         Industry push: free software license (built on own or open source but licensed software is not practical)

·         Amazon: “we do lots of things right – why not make it available for external use?”

·         Yahoo using lots of C++ and increasingly perl.  Google using lots of Python.

·         AT&T using C and Perl.

·         Google: data is highly non-normalized, lots of programmers, lots of C++.

 

Academic Panel: Stonebraker & Dewitt:

·         Sequoia project goal make Postgress work for the scientific community.

o   Was not successful for two reasons:

1.       Technical issues (Postgres wasn’t able to do some of what was needed)

2.       Scientific community doesn’t partner very well.  Research community wanted stable, hyper-reliable production systems.  What they want doesn’t exist and the community won’t investing in researching to get those production systems.

·         ASAP prototype was designed to be better than Seqoia for scientific workloads

o   Array based data model

o   Native lineage tracking

o   Supports uncertainty

o   BUT…. Can’t find the user community that wants to use it …. “you folks don’t partner well”

·         If you don’t care about perf, use Mysql etc.   If you do, use vertical segment optimized products (10 to 100x faster)

·         The great debate: a complicated solution (codyasil) vs a simple solution (relational) – Mike’s point is that the Scientific use of persisted C++ is closer to the complicated solution and probably heading in the wrong direction.

·         Need to get the community together and figure out core requirements.  Need to start sharing infrastructure and building on common kernels.

 

James Hamilton, Windows Live Platform Services
Bldg RedW-C/1279, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

 

Tuesday, October 30, 2007 4:31:48 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Monday, October 29, 2007

I love commodity parts and I like cars, so this one caught my interest. The Tesla Roadster (http://www.teslamotors.com/) battery pack is made up of many cells exactly the same as the IBM T60P that I’m typing this on. Laptop battery configurations differ dramatically, but most contain multiple 18650 form-factor batteries http://www.molienergy.bc.ca/specs/ICR18650G.pdf). The 18650 designator comes from the cell dimensions of 18mm in diameter and 65.0mm in length—a bit larger than an AA battery. Billions are sold each year.

 

How can an automobile with a nearly 250-mile range use the same power source as a laptop computer with four hours battery life?  Use lots of them. The Tesla uses 6,800 cells in their pack where each cell is roughly 2,000 mAh.  These cells are combined into in a 375V battery capable of delivering 53 kW/h of energy or roughly 200kw of power.    

 

This battery pack has a fairly high power density as does gasoline or anything else capable of storing enough power to accelerate a car from 0 to 60 MPH in under 4 seconds. High power density gets work done but, if released quickly, can be very destructive so numerous safety devices are employed.  These include an assortment of environmental sensors for conditions such acceleration, smoke, heat, humidity, current, and moisture that actively disconnect the battery pack when anomalies are detected.  In addition to these active safety devices, an array of passive safety measure are in place as well.

 

For more details on the Tesla battery pack design: http://www.teslamotors.com/display_data/TeslaRoadsterBatterySystem.pdf.

 

Lead-acid batteries, the common choice for data center backup power, are less dense and much more maintenance intensive than Li-ion. Large arrays of 18650’s might be more cost effective.  This backup power technology might even be a better choice for distributed uninterruptable power supplies (UPS).  In this configuration, rather than a large central UPS, a small UPS is installed near the servers (typically one per rack).  Li-ion cells don’t emit hydrogen gas as lead acid cells do when charged and are more tolerant to the higher temperatures found near the servers.

 

                                                --jrh

 

Monday, October 29, 2007 4:53:36 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<December 2007>
SunMonTueWedThuFriSat
2526272829301
2345678
9101112131415
16171819202122
23242526272829
303112345

Categories
This Blog
Member Login
All Content © 2012, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton