Monday, December 22, 2008

I wrote this blog entry a few weeks ago before my recent job change.  It’s a look at the cost of high-scale storage and how it has fallen over the last two years based upon the annual fully burdened cost of power in a data center and industry disk costs trends. The observations made in this post are based upon understanding these driving costs and should model any efficient, high-scale bulk storage farm. But, and I need to be clear on this, it was written prior to my joining AWS and there is no information below relating to any discussions I’ve had with AWS or how the AWS team specifically designs, deploys, or manages their storage farm.

 

When Amazon released Amazon S3, I argued that it was priced below cost at $1.80/GB/year.  At that time, my estimate of their cost was $2.50/GB/year.  The Amazon charge of $1.80/GB/year for data to be stored twice in each of two data centers is impressive. It was amazing when it was released and it remains an impressive value today. 

 

Even though the storage price was originally below cost by my measure, Amazon could still make money if they were running a super-efficient operation (likely the case).  How could they make money charging less than cost for storage? Customers are charged for ingress/egress on all data entering or leaving the AWS cloud.  The network ingress/egress charged by AWS are reasonable, but telecom pricing strongly rewards volume purchases, so what Amazon pays is likely much less than the AWS ingress/egress charges.  This potentially allows the storage business to be profitable even when operating at a storage cost loss.

 

One concern I’ve often heard is the need to model the networking costs between the data centers since there are actually two redundant copies stored in two independent data centers.  Networking, like power, is usually billed at the 95 percentile over a given period. The period is usually a month but more complex billing systems exist. The constant across most of these high-scale billing systems is that the charge is based upon peaks. What that means is adding ingress or egress at an off peak time is essentially free. Assuming peaks are short-lived, the sync to the other data center can be delayed until the peak has passed.  If the SLA doesn’t have a hard deadline on when the sync will complete (it doesn’t), then the inter-DC bandwidth is effectively without cost.  I call this technique Resource Consumption Shaping and it’s one of my favorite high-scale service cost savers.

 

What is the cost of storage today in an efficient, commodity bulk-storage service? Building upon the models in the cost of power in large-scale data centers and the annual fully burdened cost of power, here’s the model I use for cold storage with current data points:

Note that this is for cold storage and I ignore the cost of getting the data to or from the storage farm.  You need to pay for the networking you use.  Again, since it’s cold storage the model assume you can use 80% of the disk which wouldn’t be possible for data with high I/O rates per GB. And we’re using commodity SATA disks at 1TB that only consume 10W of power. This is a cold storage model.  If you are running higher I/O rates, figure out what percentage of the disk you can successfully use and update the model in the spreadsheet (ColdStorageCost.xlsx (13.86 KB)). If you are using higher-power, enterprise disk, you can update the model to use roughly 15W for each.

Update: Bryan Apple found two problems with the spreadsheet that have been updated in the linked spreadsheet above. Ironically the resulting fully brudened cost/GB/year is the unchanged. Thanks Bryan.

 For administration costs, I’ve used a fixed, fairly conservative factor of a 10% uplift on all other operations and administration costs. Most large-scale services are better than this and some are more than twice as good but I included the conservative 10% number

 

Cold storage with 4x copies at high-scale can now be delivered at: 0.80/GB/year.  It’s amazing what falling server prices and rapidly increasing disks sizes have done.  But, it’s actually pretty hard to do and I’ve led storage related services that didn’t get close to this efficient --  I still think that Amazon S3 is a bargain.

 

Looking at the same model but plugging in numbers from about two years ago shows how fast we’re seeing storage costs plunge. Using $2,000 servers rather than $1,200, server power consumption at 250W rather than 160W, disk size at ½ TB and disk cost at $250 rather than 160, yield an amazingly different $2.40/GB/year.

 

Cold storage with redundancy at: $0.80 GB/year and still falling. Amazing.

 

                                                --jrh

 

James Hamilton

jrh@mvdirona.com

 

Monday, December 22, 2008 7:24:50 AM (Pacific Standard Time, UTC-08:00)  #    Comments [19] - Trackback

Monday, December 22, 2008 8:20:52 PM (Pacific Standard Time, UTC-08:00)
Cold storage with availability requirements is an ideal candidate for Erasure Resilient Coding (ERC). I would expect the cost to fall further with ERC. It would be interesting to see how the numbers change. Perhaps I can undertake such a calculation in consultation with James.
Tuesday, December 23, 2008 6:52:11 AM (Pacific Standard Time, UTC-08:00)
I 100% agree Sudipta. erasure coding trades CPU cycles for storage density and it’s a great trade for very cold data. Most data in the enterprise is written by never looked at and much of the rest is very cold. This sounds counter-intuitive but it’s clearly the case that backups are typically not accessed. Audit logs typically aren’t looked at. Some data rarely gets read if at all. Given that, erasure coding looks pretty interesting.

This note on file access patterns is useful: http://perspectives.mvdirona.com/2008/09/28/MeasurementAndAnalysisOfLargeScaleNetworkFileSystemWorkloads.aspx.


I’ve been interested in dynamic systems where the data is stored multi-copy mirrored if accessed frequently and erasure encoded if not. Move up if accessed recently and down to less redundant encodings if not.

--jrh

Sunday, December 28, 2008 9:37:22 PM (Pacific Standard Time, UTC-08:00)
Dear James Hamilton:

Where do you get servers that cost $1200 and can run 12 cheap disks? I've been searching for just such a thing for the little storage startup that I work for -- allmydata.com -- and the closest I found was the Hewlett-Packard DL 185 G5, for something on the order of $2100.00. However, it turns out that Hewlett-Packard really doesn't want us to stock that one with $160.00 hard drives that we buy from Newegg or the like, and would rather sell us $750.00 hard drives, and so they won't sell us the drive sleds to slot drives into the DL 185 G5.

By the way, we do use erasure coding. I've written a short paper about the general scheme: lafs.pdf

Thanks,

Zooko Wilcox-O'Hearn
Monday, December 29, 2008 7:29:00 AM (Pacific Standard Time, UTC-08:00)
I enjoyed the erasure coding paper Zooko.

HP does have some reasonably cost effective servers. For example, the DL-360G5p is $1200. If they don’t want to support you running large direct attached disk arrays off that server, find another hardware vendor. Rackable Systems will have no trouble providing the config you want nor would Silicon Mechanics. Drop me a note if you have trouble chasing down either.

Thanks for the comment.

--jrh
jrh@mvdirona.com
Monday, December 29, 2008 7:59:49 AM (Pacific Standard Time, UTC-08:00)
James - nice work, thanks. I have 2 questions:

1) Would you please tell me why the Server Power Cost (line 21) does not include the Fully Burdened Cost of Power? This appears to be an error. Also, it seems you actually calculate Redundancy+1 by summing lines 17 and 18 into the subtotal. The fascinating thing about this is these two errors cancel each other out producing the correct result ($0.80/gb/year)!

2) In your model Storage Redundancy produces a “net” cost. By this I mean if you pay $0.80/GB/year you actually get to consume 4x that amount of storage. Without arguing what the correct factor should be, is this your way of internalizing unit redundancy (e.g. some RAID configuration on the server), or are you suggesting that this redundancy applies at a system level (e.g. data mirrored between physically separate data centers)? And if the latter, despite “Resource Consumption Shaping” and to be fair I think you need to include network infrastructure and transit, or at least to consider the special nature of redundancy that only delivers value when properly managed (another cost) and only during non-peak operations.



Bryan Apple
Tuesday, December 30, 2008 2:29:54 PM (Pacific Standard Time, UTC-08:00)
Good catch Bryan! The net is still $0.80/GB/year but it's an important correction so I'll update the blog entry. Thanks.

On your second point, you were asking if I mirrored between servers in a data center or between different data centers. If the former, there are no additional egress/ingress charges. If the later, you can use resource consumption shaping as you mentioned and I noted in the third paragraph above. If you take that later approach, there is no additional egress/ingress charges. If you need synchronous redundancy, then you need to add an inter-data center networking charge for the replication. I favor not paying for additional networking charges so they are not included.

Thanks again for the careful read.

James Hamilton
jrh@mvdirona.com
Monday, January 05, 2009 9:24:47 AM (Pacific Standard Time, UTC-08:00)
Very intersting article, why xlsx? Not worth downloading.
Anon
Monday, January 05, 2009 12:17:22 PM (Pacific Standard Time, UTC-08:00)
Anon, you asked why the xlsx? Basically, I find excel useful for this sort of thing. If xls is easier for you, drop me a note and I'll send you the spreadsheet in that format.

--jrh
jrh@mvdirona.com
Monday, January 05, 2009 12:57:10 PM (Pacific Standard Time, UTC-08:00)
Kinda makes sens e to me. Not bad dude!

jess
www.web-privacy.pro.tc
Jenny Woodson
Monday, January 05, 2009 12:58:39 PM (Pacific Standard Time, UTC-08:00)
If you are doing replication at the level of your global filestore then you don't need internal replication (RAID) within the servers and can even setup your drives as JBOD to support the spindles being totally independent (streaming will be limited to single spindle streaming speed, but random I/O will be faster). That will cut your storage costs in half over using RAID10 and you'll get better random I/O for variable-size read/writes than RAID5.

And if the datacenters are hooked together with dark fiber on MAN, aren't the costs simply the lease of the fiber and you get the full bandwidth without incurring bandwidth costs? (although clearly at some point you need to buy more dark fiber).
Lamont
Monday, January 05, 2009 2:11:13 PM (Pacific Standard Time, UTC-08:00)
I agree on the RAID comment. If doing replication between independent file stores, there is no need for local replication. I’m not the biggest fan of RAID5 for most work loads.

Many redundancy configurations work well. Two I've seen in use: 1) 2 servers in different data centers with 2 copies each (mirrored x2), and 2) 3 servers in a single data center each with one copy (single x3). In the blog posting, I modeled the cost of the servers and storage but not the cost of inter-data center traffic if you chose a geo-redundant solution. See early comments and answers discussion approaches to cross data center redundancy.

You asked if inter-data center traffic couldn't be made very cheap using dedicated fiber. I wish :-). Dark fiber does exists but is far from free and lighting it up (communications equipment on both ends and local connectivity to the fiber ends) is even more expensive. Generally, long haul communications costs whether on dedicated links or through good quality shared links, aren't cheap. The good news, for non-media-based applications (e.g. video), egress costs are normally less than server, infrastructure, and power costs.

--jrh
jrh@mvdirona.com
Monday, January 05, 2009 4:12:56 PM (Pacific Standard Time, UTC-08:00)
"Audit logs typically aren’t looked at."

To be PCI DSS compliant, audit logs should be closely monitored. Many enterprises are now moving in this direction which would counter the ERC argument.
mycall
Monday, January 05, 2009 6:14:24 PM (Pacific Standard Time, UTC-08:00)
Could you provide an example of a $1200 server that will take 12 drives? I'm having trouble finding even a JBOD enclosure for that price. Silicon Mechanics' Storform D53J, for example, is nearly twice that. You mention the HP, but that would require an additional expense for the enclosure, and the drive capital cost looks too low to include that.
Phil R
Tuesday, January 06, 2009 6:25:50 AM (Pacific Standard Time, UTC-08:00)
The model allocates $1,200 for a server and $1,920 for a fully populated 12 disk chassis. Essentially a SAS attached sled holding 12 commodity SATA disks. The spreadsheet is there if you want to work through different numbers or change redundancy levels.

--jrh
jrh@mvdirona.com

Tuesday, January 06, 2009 7:55:48 AM (Pacific Standard Time, UTC-08:00)
Mycall, in a comment from yesterday, you said: To be PCI DSS compliant, audit logs should be closely monitored. Many enterprises are now moving in this direction which would counter the ERC argument.

I hear you and I’m sure you are right that some audit logs actually are diligently looked at but a remarkably large number of files of all types never get re-opened once they are closed. See “Measurement and Analysis of Large Scale Network File System Workloads (http://perspectives.mvdirona.com/2008/09/28/MeasurementAndAnalysisOfLargeScaleNetworkFileSystemWorkloads.aspx)for more data.

--jrh, jrh@mvdirona.com
Tuesday, January 06, 2009 10:04:02 AM (Pacific Standard Time, UTC-08:00)
I guess then it's the $1920 for a populated chassis I'm having trouble with. Silicon Mechanics, for example, sells a populated 12x1TB JBOD (D53J) for $4500. Even assuming the drives are sourced from a cheaper supplier, the cost of the unit alone is $2400. Even with $100 drives, the capital cost per drive is then $300, and the overall cost/GB/year is $1.08. I would be delighted to be wrong about this.
Phil R
Friday, January 09, 2009 2:55:55 PM (Pacific Standard Time, UTC-08:00)
Those numbers are doable. Rackable Systems, for example, can do 12 1TB disks in a tray-type enclosure for $1,800 + cables complete.

--jrh
james@amazon.com
Tuesday, January 27, 2009 9:44:22 AM (Pacific Standard Time, UTC-08:00)
I am complete storage ignorant. But why can't you host 100 commodity TB hard drives on one $500 gigabit-compatible commodity desktop box and then achieve something like complete offline and offpower hard drive mechanical on/off switching. This way, for a customer to buy 1TB of the cheapest possible backup storage, it'd be something like $0,05/GB/year, thus for the cheapest possible backup storage. Basically you consider those 100 customers only have a combined few hundred megabit/s going through that one CPU and the gigabit lan while their files on on that server. Thus only connecting and powering the commodity TB hard drives when they are needed. Couldn't you save 99% of power consumption and overhead this way for providing the cheapest possible low popularity cloud storage.
Tuesday, January 27, 2009 8:39:58 PM (Pacific Standard Time, UTC-08:00)
Your design suggestion is a good one Charbax. The downside is the time to bring the storage back online is a fairly long but it’ll still be better than most tape solutions. Perfectly fine for many cold storage applications. The only other potential issue is folks speculate disks need to be spun at some interval, perhaps weekly, to avoid spindle lubrication issues. But, I’ve never seen data to support this so it may not be an issue.

I like your cold storage suggestion: very low cost hardware like Microslice Servers (http://perspectives.mvdirona.com/2009/01/23/MicrosliceServers.aspx) coupled with powered down disks.

--jrh
Comments are closed.

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<December 2008>
SunMonTueWedThuFriSat
30123456
78910111213
14151617181920
21222324252627
28293031123
45678910

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton