Wednesday, January 02, 2008

FusionIO has released specs and pricing data on their new NAND flash SSD: http://www.fusionio.com/products.html (Lintao Zhang of msft Research sent it my way).  100,000 IOPS, 700 MB/s sustained read, and 600 MB/s sustained write. Impressive numbers but let’s dig deeper.  In what follows, I compare the specs of the FusionIO part with a “typical” SATA disk.  For comparison with the 80GB FusionIO part, I’m using the following SATA disk specs: $200, 750GB, 70 MB/s sustained transfer rate, and 70 random IO operations per second (IOPS).  Specs change daily but it’s a good approximation of what we get from commodity SATA disk these days.

 

Obviously, the sustained read and write rates that FusionIO advertises are substantial. But to be truly interesting, they have to produce higher sustained I/O rates/dollar than magnetic disks, the high-volume commodity competitor.  A SATA disk runs around $200 and produces roughly 70 MB/s sustained I/O.  Looking at read and normalizing for price by comparing MB/s per dollar, we see that the FusionIO part can do roughly 0.29 MB/s/$ whereas the SATA disk will produced 0.35 MB/s/$. The disk produces slightly better sequential transfer rates per dollar. This isn’t surprising, in that we know that disks are actually respectable at sequential access—this workload pattern is not where flash really excels.  For sequential workloads or workloads that can be made sequential, at the current price point, I wouldn’t recommend flash in general or the FusionIO SSD in particular. Where they really look interesting is in workloads with highly random I/Os.

 

Looking at capacity, there are no surprises.  Normalizing to dollars/GB, we see the FusionIO part at $30/GB and the SATA disk at $0.26/GB.  Where capacity is the deciding factor, magnetic media is considerably cheaper and will be for many years.  Capacity per dollar is not where flash SSDs look best.

 

Where flash SSDs really excels, and where the FusionIO part is particularly good, is in random I/Os per second.  They advertise over 100,000 random 4k IOPS (87,500 8k IOPS) whereas our SATA disk can deliver about 70.  Again normalizing for costs and looking at IOPS per dollar, we see the FusionIO SSD at 41 IOPS/$ whereas the SATA disk is only 0.27 IOPS/$.  Flash SSDs win, and win big, on random I/O workloads like OLTP systems (usually random-I/O-operation bound).  These workloads typically run the smallest and fastest disks they can buy, and yet still can’t use the entire disk since the workload I/O rates are so high.  To support these extremely hot workloads using magnetic disk, you must spread the data over a large number of disks to effectively dilute the workload I/O rate to that which disks can support. 

 

For workloads where the random I/O rates are prodigious and the overall database sizes fairly small, flash SSDs are an excellent choice.  How do we define “fairly small”? I look at it this way: it’s a question of I/O density and I define I/O density to be random IOPS per GB. The SATA disk we are using as an example can support 0.09 IOPS/GB (70/750).  If the workload requires less than 0.09 IOPS/GB, then it will be disk-capacity bound whereas if it needs more than 0.09 IOPS/GB, then it’s I/O bound.  Assuming the workload is IO bound, how to decide whether SSDs are the right choice? Start by figuring out how many disks would be required to support the workload and what they would cost: take the sustained random IOPS required by the application and divide by the number of IOPS each disk can sustain (70 in the case of our example SATA drive or 180 to 200 if using enterprise disk). That defines the cost of supporting this application using magnetic disk.  Now figure the same number for flash SSD.  Aggregate workload required I/O rate divided by the sustained random IOPS the SSD under consideration can deliver.  This will determine how many disks are needed to support the I/O rate.  Given that flash SSDs can deliver very high I/O densities, you also need to ensure you have enough SSDs to store the entire database.  Take the maximum of the number of SSDs required to store the database (if capacity bound) and the number of SSDs required to support the I/O rate (if IOPS bound), and that’s the number of SSDS needed. Compare the cost of the SSDs with the cost of the disk required to support the same workload, and see if SSD is cheaper.  For VERY hot workloads, flash SSDs will be cheaper than hard disk drives.

 

I should point out there are many other factors potentially worth considering when deciding whether a flash SSD is the right choice for your workload, including the power consumption of the disk farm, the failure rate and cost of service, and the wear-out rate and exactly how it was computed for the SSDs. The random I/O rate is the biggest differentiator and the most important for many workloads, so I haven’t considered these other factors here.

 

Looking more closely at the FusionIO specs, we see they give specs on random IOPS but they don’t specify the read-to-write ratios they can support.  We really need to see the number of random read and write IOPS that can be sustained. This is particularly important for flash SSDs since, with these devices, getting extremely high random write I/O rates is VERY difficult and this is typically where the current generation devices fall down.  In the case of the FusionIO SSD, we still don’t have the data we need to make a decision and would still need to get third-party benchmark data before making a buying decision.

 

Another option to consider when looking at very hot workloads is to move the workload into memory when the I/O densities are extreamly high.  Memory prices continue to fall and several memory appliance start-ups have recently emerged. I suspect hybrid devices that combine very large DRAM caches with 10 to 100x larger flash stores will likely emerge as great choices over the next year or so.  Of the memory appliance vendors, I find Violin Memory to be the most interesting of those I’ve looked at (http://www.violin-memory.com/). 

 

I do love the early performance numbers that FusionIO is now reporting—these are exciting results. But remember, when looking at flash SSDs, including the FusionIO part, you need to get the random write IOPS rate before making a decision.  It’s the hardest spec to get right in a flash SSD and I’ve seen results as divergent as random write IOPS at 1/100th the (typically very impressive) read rate. Ask for random write IOPS.

 

                                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Wednesday, January 02, 2008 12:38:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Wednesday, December 26, 2007

SpecPOWER is the first industry standard benchmark that evaluates power and performance characteristics of high volume servers.  Yesterday the final spec was released.

 

I have quibbles with this benchmark but they really are mostly quibbles.  Generally, I’m thrilled to see a benchmark out there that shines a light on server power consumption.  The entire industry needs it. Benchmarks drive action and we have lots of low hanging fruit to grab when it comes to server power/performance.

 

The press release is up at: http://www.spec.org/power_ssj2008/ and early results are at http://www.spec.org/power_ssj2008/results/power_ssj2008.html.

 

I remember working on IBM DB2 years back when we first ran TPC-A. Prior to the benchmark we had a perfectly good database management system that customers were using successfully to run some very large businesses. When we first ran TPC-A, let’s just say we didn’t like what we saw.  We got to work and ended up improving DB2 by a full factor of 10 in that release and then went on and improved by a further factor of 4 in the next release.  Yes I know the only way to improve that much is to start off REALLY needing it.  That’s my point. As a direct result of the benchmark giving us and customers increased visibility into OLTP performance, the product improved 40x in less than 5. Customers gained, the product got better, and it gave the engineering team good goals to rally around. Benchmarks help customers.

 

As benchmarks age, it’s harder to find the big differentiating improvements. Once the easy changes are found and even the more difficult improvements have been worked through, benchmark specials typically begin to emerge.  Companies start releasing changes that help the benchmarks and do nothing and, in rare cases, can even hurt real customer workloads. Eventually, this almost always happens and causes industry veterans, myself included, to distrust benchmarks. We forget that in the early years of most benchmarks, they really did help improve the product and delivered real value to customers.

 

I’m very happy to see SPEC release a benchmark that measures power and I’m confident that it will help drive big power efficiency gains. Sure, we’ll eventually see the game playing benchmark specials but, for now, I predict many of the improvements will be real. This will help the industry evolve more quickly.  Now can someone please start working on a data center efficiency benchmark?  Huge innovations are waiting in data center design and even more in how existing designs are deployed.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Wednesday, December 26, 2007 11:09:33 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Saturday, December 22, 2007

I’m online over the holidays but everyone’s so busy there isn’t much point in blogging during this period.  More important things dominate so I won’t be posting until early January. 

 

Have a great holiday.

 

                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201

Saturday, December 22, 2007 8:59:11 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware | Process | Ramblings | Services
 Tuesday, December 18, 2007

Yesterday Nanosolar announced it has started to sell cheap solar panels at a materials cost of roughly $1/W and a finished panel cost of ~$2/W: http://www.reuters.com/article/technologyNews/idUSN1846022020071218.  As a reference point, coal powered electricity is about $1/W when considering the fully burdened cost of the plant and operations.  When a new technology is within a factor of two of an old, heavily invested technology, I get excited.  This is getting interesting.

 

Nanosolar was founded in 2002 and has plants in San Jose, CA and another near Berlin, Germany. They received $20M from Mohr Davidow Ventures.  Other investors include Sergey Brin and Larry Page.

 

More articles on Nanosolar: http://www.nanosolar.com/articles.htm.  Thanks to Will Whitted for directing me to Nanosolar.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Tuesday, December 18, 2007 11:04:47 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Monday, December 17, 2007

This note on the Google infrastructure was sent my way by Andrew Kadatch (Live Search) by way of Sam McKelvie (Cloud Infrastructure Services).  It’s not precise in all dimensions, but it does a good job of bringing together the few facts that have been released by Google, and it points to its references if you are interested in a deeper dive.

 

The infrastructure summary is at: http://highscalability.com/google-architecture.  No new data here, but a good summary if you haven’t read all the direct sources from which the summary was drawn.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-C/1279, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh 

Monday, December 17, 2007 11:03:33 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services

Last week Google announced their Renewable Energy Cheaper than Coal (RE<C) initiative.  This is a two pronged effort combining external investment with internal Google research.  The external investments include $10M in Makani Power, a startup aiming to harness high altitude winds.  Makani explains that high altitude winds have the highest energy density of any renewable energy source.  Google also invested in eSolar as part of this program.  eSolar is a startup aiming to produce utility scale solar energy farms ranging from 25MW to 500MW.

 

More detail on:

·         Makani: http://www.google.com/corporate/green/energy/makani.pdf

·         eSolar: http://www.google.com/corporate/green/energy/esolar.pdf

 

In addition to the external investments, Google is staffing a team with a mission of assembling 1 Gigawatt of renewable energy capacity that is cheaper than coal.  More detail at: http://www.google.com/corporate/green/energy/index.html.

 

It’s hard to argue with the “cheaper than coal” target.  If renewable resources actually could be less expensive than coal, they would be used without social pressure, government subsidies, or other external tweaks to the market. Looking at the business side, this looks like a good marketing investment for Google as well.  Google’s datacenter footprint continues to grow at an unprecedented pace while, at the same time, datacenter power consumption is becoming an social concern and attracting regulatory attention.  This investment is clearly a well timed and well thought through business investment as well.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Monday, December 17, 2007 10:59:56 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Friday, December 14, 2007

The number 1 Amazon AWS requirement just got met: structured storage.  Amazon announced SimpleDB yesterday although it’s not yet available for developers to play with.  I’m looking forward to being able to write a simple application against it – I’ve had fun with S3. But, for now, the docs will have to do.

 

In the announcement (http://aws.amazon.com/simpledb), it’s explained that SimpleDB is not a relational DB.  AWS notes that some customers run relational DBs in EC2 and, those that need complex or strictly enforced schema, will continue to do this.  Others that only need a simple structured store with much less administrative overhead will use SimpleDB.  AWS explains that they will make it increasingly easy to do both.

 

The hard part of running a RDBMS in EC2 is that there is no data protection.  The EC2 local disk is 100% ephemeral.  Some folks are using block replicators such as DRBD (http://www.drbd.org/) to keep two MySQL systems in sync.  It’s kind of a nice solution but requires some skill to set up.  When AWS says “make it easier” I suspect they are considering something along these lines.  A block replicator would be a wonderful EC2 addition.  However, for those that really only need a simple structured store, SimpleDB is (almost) here today.

 

My first two interests are data model and pricing.  The data model is based upon domains, items, attributes and values.  A domain roughly corresponds to a database and you are allowed up to 100 domains.  All queries are within a domain.  Within a domain, you can create items.  Each item has an ID (presumably unique).  Every Item has attributes and attributes have values.  You don’t need to (and can’t) declare the schema of an item and any item can have any number of attributes up to 256 per item.  Attributes have values and the values can repeat.  So a given item may have an attribute or may not and it can have the attribute more than once.  Attributes have values and value are simple UTF-8 strings limited to 1024 bytes.   All attributes are indexed.

 

The space overhead gives a clue to storage format:

·         Raw byte size (GB) of all item IDs + 45 bytes per item +

·         Raw byte size (GB) of all attribute names + 45 bytes per attribute name +

·         Raw byte size (GB) of all attribute-value pairs + 45 bytes per attribute-value pair

 

The storage format appears to be a single ISAM index of (item, attribute, value).  It wouldn’t surprise me if the index used in SimpleDB is the same code that S3 uses for metadata lookup.

 

The query language is respectable and includes: =, !=, <, > <=, >=, STARTS-WITH,  AND, OR, NOT, INTERSECTION AND UNION.  Queries are resource limited to know more than 5 seconds of execution time.

 

The storage model, like S3, is replicated asynchronously across data centers with the good and the bad that comes with this approach: the data is stored geo-redundantly which is wonderful but it is possible to update a table and, on a subsequent request, not even see your own changes.  The consistency model is very weak and the storage reliability is very strong. I actually like the model although most folks I talk to complain that it’s confusing.  Technically this is true but S3 uses the same consistency model.  I’ve spoken to many S3 developers and never heard a complaint (admittedly some just don’t understand it).

 

Pricing was my second interest. There are three charges for SimpleDB:

·         Machine Utilization: $0.14/machine hour

·         Data Transfer:

o    $0.10 per GB - all data transfer in

o    $0.18 per GB - first 10 TB / month data transfer out

o    $0.16 per GB - next 40 TB / month data transfer out

o    $0.13 per GB - data transfer out / month over 50 TB

·         Structured Storage: $1.50 GB/month

 

$1.50 GB/month or $18 GB/year is 10x the $1.80 GB/year charged by S3.  Fairly expensive in comparison to S3 but a bargain compared to what it cost to manage an RDBMS and the hardware that supports it.

 

More data on SimpleDB from AWS: http://aws.amazon.com/simpledb.

 

Sriram Krishnan has an excellent review up at: http://www.sriramkrishnan.com/blog/2007/12/amazon-simpledb-technical-overview.html

 

Thanks to Sriram Krishnan (Develop Division) and Dare Obasanjo (WinLive Platform Services) for sending this one my way.  I’m really looking forward to playing with SimpleDB and seeing how customers end up using it.  Overall, I find it tastefully simple and yet perfectly adequate for many structured storage tasks.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Friday, December 14, 2007 10:57:22 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Wednesday, December 12, 2007

SpecPOWER is the first industry standard benchmark that evaluates power and performance characteristics of high volume servers.  Yesterday the final spec was released (see below thanks to Kushagra Vaid of GFS H/W Architecture & Standards). 

 

I have quibbles with this benchmark but they really are mostly quibbles.  Generally, I’m thrilled to see a benchmark out there that shines a light on server power consumption.  The entire industry needs it. Benchmarks drive action and we have lots of low hanging fruit to grab when it comes to server power/performance.

 

The press release is up at: http://www.spec.org/power_ssj2008/ and early results are at http://www.spec.org/power_ssj2008/results/power_ssj2008.html.

 

I remember working on IBM DB2 years back when we first ran TPC-A. Prior to the benchmark we had a perfectly good database management system that customers were using successfully to run some very large businesses. When we first ran TPC-A, let’s just say we didn’t like what we saw.  We got to work and ended up improving DB2 by a full factor of 10 in that release and then went on and improved by a further factor of 4 in the next release.  Yes I know the only way to improve that much is to start off REALLY needing it.  That’s my point. As a direct result of the benchmark giving us and customers increased visibility into OLTP performance, the product improved 40x in less than 5. Customers gained, the product got better, and it gave the engineering team good goals to rally around. Benchmarks help customers.

 

As benchmarks age, it’s harder to find the big differentiating improvements. Once the easy changes are found and even the more difficult improvements have been worked through, benchmark specials typically begin to emerge.  Companies start releasing changes that help the benchmarks and do nothing and, in rare cases, can even hurt real customer workloads. Eventually, this almost always happens and causes industry veterans, myself included, to distrust benchmarks. We forget that in the early years of most benchmarks, they really did help improve the product and delivered real value to customers.

 

I’m very happy to see SPEC release a benchmark that measures power and I’m confident that it will help drive big power efficiency gains. Sure, we’ll eventually see the game playing benchmark specials but, for now, I predict many of the improvements will be real. This will help the industry evolve more quickly.  Now can someone please start working on a data center efficiency benchmark?  Huge innovations are waiting in data center design and even more in how existing designs are deployed.

 

                                                --jrh

 

 

From: osgmembers-request@spec.org [mailto:osgmembers-request@spec.org] On Behalf Of Alan Adamson
Sent: Tuesday, December 11, 2007 8:44 AM
To: osgmembers@spec.org
Subject: (osgmembers-363) SPECpower benchmark availability

 


I am delighted to announce the public availability of the initial SPECpower benchmark, SPECpower_ssj2008 - you can find the initial press release here http://www.spec.org/power_ssj2008/ , and see initial results here :
http://www.spec.org/power_ssj2008/results/power_ssj2008.html

This benchmark release is the result of a long and strenuous process (I cannot recall a committee meeting four days a week for several hours each day), needed partly because of the new ground being plowed.  

I expect that all forthcoming benchmark development groups will be thinking about how to incorporate power measurements, and that the power committee will be looking at extending the scope of its workloads, and also helping other committees learn from their experience.

And now, on to several other benchmarks in or heading for general membership review.  This is a pretty exciting time for the OSG!

My thanks and congratulations to those who worked so hard on this initial SPECpower benchmark.

Alan Adamson                                                                        905-413-5933  Tieline 969-5933   FAX: 905-413-4854
OSG Chair, SPEC                                                                  Internet:
adamson@ca.ibm.com
Java Technology Centre
IBM Toronto Lab

 

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Wednesday, December 12, 2007 10:56:15 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Tuesday, December 11, 2007

Google announced a project in October 2006 to install 9,212 solar panels on the roof of its headquarters complex.  Currently over 90% of these panels are installed and active. They expect the installation will produce 30% of the peak power requirements of the headquarters buildings. For example, as I post this, they report 0.6 MW have been produced over the last 24 hour period.  The roughly ½ megawatt is arguably in the noise when compared with the Google total data center power consumption which remains a carefully guarded secret.  With high 10’s of data centers world-wide each consuming in the 10WM range, the 0.6 produced on the roof of the HQ building is a very small number.  Nonetheless, 0.6MW is MUCH better than nothing and it’s great to see these resources committed to making a difference.

 

The real time power production report for this installation can be found at: http://www.google.com/corporate/solarpanels/home.  Thanks to Lewis Curtis for sending this my way.

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Tuesday, December 11, 2007 10:53:11 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Sunday, December 09, 2007

I’ve long argued that the firm and clear division between development and operations common in many companies is a mistake.  Development doesn’t feel the pain and understand what it takes to make their services more efficient to operate.  Operations tends to hire more people to deal with the mess.  It doesn’t work, it isn’t efficient, and it’s slow.  The right model has the development/ops line very blurred. The Amazon model takes this to an extreme with essentially a “you wrote it, you run it” approach. This is nimble and delivers the pain right back where it can be most efficiently solved: development.  But it can be bit tough on the engineering team.  The approach I like most is a hybrid, where we have a very small tier 1 support team operations with everything else going back to development.

 

However, if I had to err to one extreme or the other, I would head down the Amazon path.  A clear division between ops and dev leads to an over-the-wall approach that is too slow and too inefficient. What’s below is more detail on the Amazon approach, not so much because they do it perfectly but because they represent an extreme and therefore are a good data point to think though.  Also note the reference to test-in-production found in the last paragraph.  This is 100% the right approach in my opinion.

 

                                                --jrh

 

When I was at Amazon (and I don’t think anything has changed) services were owned soup to nuts by the product team. There were no testers and no operations people. Datacenter operations, security, and operations were separate, centralized groups.

 

So ‘development’ was responsible for much of product definition and specification, the development, deployment and operation of services. This model came down directly from Jeff Bezos; his intention was to centralize responsibility and take away excuses. If something went wrong it was very clear who was responsible for fixing it. Having said that, Amazon encouraged a no-blame culture. If things went wrong the owner was expected to figure out the root cause and come up with a remediation plan.

 

This had many plusses and a few negatives. On the positive side:

·         Clear ownership of problems

·         Much shorter time to get features into production. There was no hand-off from dev to test to operations.

·         Much less requirement for documentation, (both a plus and a minus)

·         Very fast response to operational issues, since the people who really knew the code were engaged up-front.

·         Significant focus by developers on reliability and operability, since they were the people responsible for running the service

·         Model works really well for new products

Negatives:

·         Developers have to carry pagers. On-call rotations, with people on-call required to respond to a sev-1 within 15 minutes of being paged 24 hours a day. Could lead to burn-out.

·         Ramp up curve for new developers is extremely steep because of the lack of documentation and process

·         For some teams the operations load completely dominated, making it very difficult to tackle new features or devote time to infrastructure work.

·         Coordinating large feature development across multiple teams is tough.

 

The one surprise in this is that in my opinion code quality was as good as or better than here, despite having no testers. People developed an approach of deploying to a few machines and observing their behavior for a few days before starting a large roll out. Code could be reverted extremely quickly.

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh 

Sunday, December 09, 2007 10:51:48 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Thursday, December 06, 2007

Michael Hunter, who authors the Testing and Debugging blog at Dr. Dobb’s Journal, asked me for an interview on testing related topics some time back. I’ve long lamented that, industry-wide, there isn’t nearly enough emphasis on test and software quality assurance innovation. For large projects, test is often the least scalable part of the development process.  So, when Michael offered me a platform to discuss test more broadly, I jumped on it. 

 

Michael structures these interviews, and his subsequent blog entry, around five questions.  These ranged from where I first got involved in software testing, through the most interesting bug I’ve run into, what has most surprised me about testing, what’s the most important thing for a tester to know, and what’s the biggest challenge facing the test discipline over the next five years.

 

Michael’s interview is posted at: http://www.ddj.com/blog/debugblog/archives/2007/12/five_questions_39.html.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Thursday, December 06, 2007 10:50:14 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Wednesday, December 05, 2007

Mike Zintel (Windoes Live Core) sent this one my way.  It’s a short 2:45 video that is not particularly informative but it is creative: http://www.youtube.com/watch?v=fi4fzvQ6I-o.

 

                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh 

Wednesday, December 05, 2007 10:46:39 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Tuesday, December 04, 2007

Amazon doesn’t release much about its inner workings which is unfortunate in that hey do some things notably well and often don’t get credit.  My view is that making some of these techniques more public would be a great recruiting tool for Amazon but I understand the argument for secrecy as well.

 

Ronny Kohavi recently pointed me to this presentation on A/B testing at Amazon.  It’s three years old but still well worth reading.  Key points from my perspective:

·         Amazon is completely committed to A/B testing.  In past Bezos presentations he’s described Amazon as a “data driven company”. One of the key advantages of a service is you get to see in real time how well it’s working.  Any service that doesn’t take advantage of this and is just making standard  best guess or informed expert decisions, is missing a huge opportunity and hurting their business.  The combination of A/B testing and cycling through ideas quickly does two wonderful things: 1) it makes your service better FAST, and 2) it takes the politics and influence out of new ideas.  Ideas that work win and, those that don’t show results, don’t get used whether proposed by a VP or the most junior web designers.  It’s better for the service and for everyone on the team.

·         The infrastructure focus at Amazon. Bezos gets criticized by Wall Street analyst for over investing in infrastructure but the infrastructure investment gives them efficiency and gives them pricing power which is one of their biggest assets.  The infrastructure investment also allows them to host third parties which gives Amazon more scale in a business where scale REALLY matters and it gives customers broader selection which tends to attract more customers.  Most important, it gives Amazon, the data driven company, more data and this data allows them to improve their service rapidly and give customers a better experience: “customers who bought X…”

·         Negative cost of capital: Slide 8 documents how they get a product on day 0, sell it on day 20, get paid on day 23, and pay the supplier on day 44.

·         Slide 7 shows what can be done with a great infrastructure investment: respectable margins and very high inventory turn rates.

 

The presentation is posted: http://ai.stanford.edu/~ronnyk/emetricsAmazon.pdf

 

                                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh 

 

Tuesday, December 04, 2007 7:23:10 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Friday, November 30, 2007

Some months back I finished a paper with Joe Hellerstein and Michael Stonebraker scheduled to be published in the next issue of Foundations and Trends of Databases.  This paper is aimed at describing how current generation database management systems are implemented.  I’ll post a reference to it here once it is published.

 

As very small part of this paper, we cover the process model used by Oracle, DB2, MySQL, SQL Server, and PostgreSQL.  A process model is how a database maps the work it’s doing on behalf of multiple concurrent users onto operating system processes and/or threads.  This is an important design choice in that it has fundamental impact on the number of concurrent requests that can be supported, development costs, maintainability, and code base portability amongst other issues. 

 

These same design choices are faced by most high scale server designers and is equally applicable to mail servers, web servers, app servers, and any other application needing to service large numbers of requests in parallel. Given the importance of the topic and it’s applicability to all multi-user server systems, it’s worth covering separately here.  I find it interesting to note that three of the leading DBMSs support more than one process model and one supports four variants. There clearly is no single right answer.

 

Summarizing the process models supported by IBM DB2, MySQL, Oracle, PostgreSQL, and Microsoft SQL Server:

 

1.      Process per DBMS Worker: This is the most straight-forward process model and is still heavily used today.  DB2 defaults to process per DBMS worker on operating systems that don’t support high quality, scalable OS threads and thread per DBMS worker on those that do.  This is also the default Oracle process model but they also supports process pool as described below as an optional model.  PostgreSQL runs the Process per DBMS Worker model exclusively on all operating system ports.

2.      Thread per DBMS Worker: This an efficient model with two major variants in use today:

a.       OS thread per DBMS Worker: IBM DB2 defaults to this model when running on systems with good OS thread support. This is the model used by MySQL as well.

b.      DBMS Thread per DBMS Worker: In this model DBMS Workers are scheduled by a lightweight thread scheduler on either OS processes or OS threads both of which are explained below. This model avoids any potential OS scheduler scaling or performance problems at the expense of high implementation costs, poor development tools and debugger support, and substantial long-standing maintenance costs.  There are two sub-categories of this model:

                                                              i.      DBMS threads scheduled on OS Process: a lightweight thread scheduler is hosted by one or more OS Processes.  Sybase uses this model and began with the thread scheduler hosted by a single OS process.  One of the challenges with this approach is that, to fully exploit shared memory multi-processors, it is necessary to have at least one process per processor.  Sybase has since moved to hosting DBMS threads over potentially multiple OS processes to avoid this limitation.  When DBMS threads within multiple processes, there will be times when one process has the bulk of the work and other processes (and therefore processors) are idle.  To make this model work well under these circumstances, DBMSs must implement thread migration between processes. Informix did an excellent job of this starting with the Version 6.0 release.  All current generation systems supporting this model implement a DBMS thread scheduler that schedules DBMS Workers over multiple OS processes to exploit multiple processors.

                                                            ii.      DBMS threads scheduled on OS Threads:  Microsoft SQL Server supports this model as a non-default option.  By default, SQL Server runs in the DBMS Workers multiplexed over a thread pool model (described below).  This SQL Server option, called Fibers, is used in some high scale transaction processing benchmarks but, otherwise, is in very light use.

3.      Process/Thread Pool: In this model DBMS workers are multiplexed over a pool of processes.  As OS thread support has improved, a second variant of this model has emerged based upon a thread pool rather than a process pool.  In this later model, DBMS workers are multiplexed over a pool of OS threads:

a.       DBMS workers multiplexed over a process pool: This model is much more efficient than process per DBMS worker, is easy to port to operating systems without good OS thread support, and scales very well to large numbers of users.  This is the optional model supported by Oracle and the one they recommend for systems with large numbers of concurrently-connected users.  The Oracle default model is process per DBMS worker.  Both of the options supported by Oracle are easy to support on the vast number of different operating systems they target (at one point Oracle supported over 80 target operating systems).

b.      DBMS workers multiplexed over a thread pool: Microsoft SQL Server defaults to this model and well over 99% of the SQL Server installations run this way. To efficiently support 10’s of thousands of concurrently connected users, SQL Server optionally supports DBMS threads scheduled on OS threads.

 

Most current generation commercial DBMSs support intra-query parallelism, the ability to execute all or parts of query in parallel. Essentially, intra-query parallelism is the temporary assignment of multiple DBMS workers to execute a SQL query.  The underlying process model is not impacted by this feature other a single client connection may, at times, have more than a single DBMS worker.

 

Process model selection has a substantial influence on DBMS scaling and portability. As a consequence, three of the most successful commercial systems each support more than one process model across their product line.   From an engineering perspective, it would clearly be much simpler to employ a single process model across all operating systems and at all scaling levels.  But, due to the vast diversity of usage patterns and the non-uniformity of the target operating systems, however, each DBMS has elected to support multiple models.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

 

Friday, November 30, 2007 5:48:30 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Monday, November 26, 2007

There are few things we do more important than interviewing and yet it’s done very unevenly across the company.  Some are amazing interviewers and others, well, I guess they write good code or something else J.

 

Fortunately, interviewing can be learned and, whatever you do and wherever you do it, interviewing with insight pays off.  Some time back, a substantial team was merged with the development group I lead and, as part of the merger,  a bunch of senior folks many of whom do As Appropriate (AA) interviews and all of which were frequently contributors on our interview loops.  The best way to get in sync on interviewing techniques and leveling is to talk about it so I brought us together several times to talk about interviewing, to learn from each other, and set some standards on how we’re going to run our loops.  In preparation for that meeting, I wrote up some notes of what I view as best practices for AAs but these apply to all interviewers and I typically send these out whenever I join a team. 

 

Some of these are specific to Microsoft but many apply much more broadly.  There is some internal Microsoft jargon used in the doc.  For example, at Microsoft the AA is short for “As Appropriate” and is the final decision making on whether an offer will be made. However, most of what is here is company invariant.

 

The doc: JamesRH_AA_Interview_NotesX.doc (37 KB).

 

I also pulled some of the key points into a ppt: AA Interview NotesX.ppt (156.5 KB).

 

                                    --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

 

Monday, November 26, 2007 5:44:24 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Process
 Friday, November 23, 2007

Ted Wobber (msft Research) brought together the following short list of SSD performance data.  Note the FusionIO part claiming 87,500 IOPS in a 640 GB package.  I need to run a perf test against that part and see if it's real.  It looks perfect for very hot OLTP workloads.

 

A directory of “fastest SSDs”:

http://www.storagesearch.com/ssd-fastest.html

                Note that this contains RAM SSDs as well as flash SSDs.  This list, however, seems to be ranked by bandwidth, not IOPs.

 

This manufacturer make a very high-end database accelerator:

http://www.stec-inc.com/technology/

Among the things that they do are:   most likely logical address re-mapping, way over-provisioning of free space, highly parallel ops

 

Then there are these guys who do the hard work in the host OS:

http://managedflash.com/home/index.htm

They clearly do logical address re-mapping, but their material is strangely devoid of mention of cleaning costs.  Perhaps they get “free” hints from the OS free block table.

 

Nevertheless, the following article is worth reading:

http://mtron.easyco.com/news/papers/easyco-flashperformance-art.pdf

 

FusionIO: http://fusionio.com/.

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

 

Friday, November 23, 2007 6:21:26 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Tuesday, November 20, 2007

Google has been hiring networking hardware folks so it’s been long speculated that they are building their own network switches.  This remains speculation only but the evidence is mounting:

 

From http://www.nyquistcapital.com/2007/11/16/googles-secret-10gbe-switch/ (Sent my way by James Depoy of the OEM team and Michael Nelson of SQL Server):

Through conversations with multiple carrier, equipment, and component industry sources we have confirmed that Google has designed, built, and deployed homebrewed 10GbE switches for providing server interconnect within their data centers.

We believe Google based their current switch design on Broadcom’s (BRCM) 20-port 10GE switch silicon (BCM56800) and SFP+ based interconnect. It is likely that Broadcom’s 10GbE PHY is also being employed. This would be a repeat of the same winner-take-all scenario that played out in 1GbE interconnect.

 

This article attempts to track Google consumption by tracking shipments of 20 port 10GigE silicone. Not a bad approach.  In their determination, Google is installing 5,000 ports a month. If you assume that servers dominate over inter-switch connections, the implication is that Google is installing nearly 5k servers per months.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

Tuesday, November 20, 2007 8:38:39 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Monday, November 19, 2007

Last week I attended and presented at USENIX LISA (http://www.usenix.org/event/lisa07/) conference. I presented Designing and Deploying Internet-Scale Applications and the slides are at: PowerPoint slides.

 

I particularly enjoyed Andrew Hume’s (AT&T) talk where he talked about the storage sub-systems used at AT&T research and the data error rates he’s been seeing over the last several decades and what he does about it.  His experience exactly parallels mine with more solid evidence and can be summarized by all layers in the storage hierarchy produce errors. The only way to store data for the long haul is with redundancy coupled with end-to-end error detection.  I enjoyed the presentations of Shane Knapp and Avleen Vig of Google in that they provided a small window into how Google takes care of their ~10^6 servers with a team of 30 or 40 hardware engineers world-wide, the software tools they use to manage the world’s biggest fleet and the releases processes used to manage these tools. Guido Trotter also of Google talked about how Google IT (not the production systems) were using Xen and DRDB to build a highly reliable IT systems.  He used DRDB (http://www.drbd.org/download.html) to do asynchronous, block level replication between a primary and a secondary.  The workloads runs in a Xen virtual machine and, on failure, is restarted on the secondary. Ken Brill, Executive Director of the Uptime Institute  made a presentation focused on power being the problem.  Ignore floor space cost, system density is not the issue, it’s a power problem. He’s right and it’s becoming increasingly clear each year.

 

 My rough notes from the sessions I attended are at:  JamesRH_Notes_USENIXLISA2007x.docx (21.03 KB).

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

 

Monday, November 19, 2007 7:06:36 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Friday, November 16, 2007

Three weeks ago I presented at HPTS (http://www.hpts.ws/index.html). HPTS is an invitational conference held every two years since 1985 in Asilomar California that brings together researchers, implementers, and users of high scale transaction processing systems.  It’s one of my favorite conferences in that it attracts a very interesting group of people, is small enough that everyone can contribute and there is lots of informal discussion in a great environment on the ocean near Monterey.

 

I presented Modular Data Center Design and Designing and Deploying Internet-Scale Services.  A highlight of this year’s session was a joint keynote address from David Patterson of Berkeley and Burton Smith of Microsoft.  Dave's slides are posted at DavidPattersonTechTrends2007.ppt (442.5 KB).  Burton's not in the office right now so I don't have access to his but will post them when I do.

 

I’m the General Chair for the 2009 HPTS which is scheduled to be October 25 through 28, 2009.  Keep the date clear and plan on submitting an interesting position paper to get invited.  If you are doing high scale data centric applications, HPTS is always fun.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

 

Friday, November 16, 2007 5:19:45 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Monday, November 12, 2007

For the last year or so I’ve been collecting Scaling Web Site war stories and I’ve been posting them to my Microsoft internal blog.  I collect them for two reasons: 1) scaling web site problems all center around persistent state management and I’m a database guy so the interest is natural, and 2) it’s amazing how frequently the same trend appears: design a central DB.  Move to functional partition. Move to a horizontal partition. Somewhere through that cycle, add caching at various levels.  Most skip the step hardware evolution of starting with scale-up servers and then moving to scale out clusters but even that pattern shows up remarkably frequently (e.g. eBay, and Amazon).

 

Scaling web site war stories:

·         Scaling Amazon: http://glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

·         Scaling Second Life: http://radar.oreilly.com/archives/2006/04/web_20_and_databases_part_1_se.html

·         Scaling Technorati: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/

·         Scaling Flickr: http://radar.oreilly.com/archives/2006/04/database_war_stories_3_flickr.html

·         Scaling Craigslist: http://radar.oreilly.com/archives/2006/04/database_war_stories_5_craigsl.html

·         Scaling Findory: http://radar.oreilly.com/archives/2006/05/database_war_stories_8_findory_1.html

·         MySpace 2006: http://sessions.visitmix.com/upperlayer.asp?event=&session=&id=1423&year=All&search=megasite&sortChoice=&stype=

·         MySpace 2007: http://sessions.visitmix.com/upperlayer.asp?event=&session=&id=1521&year=All&search=scale&sortChoice=&stype=

·         Twitter, Flickr, Live Journal, Six Apart, Bloglines, Last.fm, SlideShare, and eBay: http://poorbuthappy.com/ease/archives/2007/04/29/3616/the-top-10-presentation-on-scaling-websites-twitter-flickr-bloglines-vox-and-more

 

Thanks to Soumitra Sengupta for sending the Flickr and PoorButHappy pointer my way and to Jeremy Mazner for sending the MySpace references.

 

                                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | Msft internal blog: msblogs/JamesRH

 

Monday, November 12, 2007 5:22:28 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<January 2008>
SunMonTueWedThuFriSat
303112345
6789101112
13141516171819
20212223242526
272829303112
3456789

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton