Saturday, April 18, 2009

In Where SSDs Don’t Make Sense in Server Applications, we looked at the results of a HDD to SSD comparison test done by the Microsoft Cambridge Research team.  Vijay Rao of AMD recently sent me a pointer to an excellent comparison test done by AnandTech. In SSD versus Enterprise SAS and SATA disks, Anandtech compares one of my favorite SSDs, the Intel X25-E SLC 64GB, with a couple of good HDDs. The Intel SSD can deliver 7000 random IOPS/sec and the 64GB component is priced in the $800 range.

 

The full AnandTech comparison is worth reading but I found the pricing with sequential and random I/O performance data is particularly interesting. I’ve brought this data together into the table below:

 

Drive

Capacity

Pricing

$/GB

$/Seq Read ($/MB/s)

$/Seq Write $/MB/s)

Seq I/O Density

$/Rdm  Read ($/MB/s)

$/Rdm  Write ($/MB/s)

Rdm I/O Density

 

Intel X25-E SLC

64GB

$795-$900

$13.24

$3.28

$4.28

3.563

$17.66

$9.02

1.109

 

Cheetah 15k

300GB

$270-$300

$0.95

$2.28

$2.24

0.420

$142.50

$57.00

0.012

 

WD 1000FYPS

1TB

$190-$200

$0.20

$2.71

$2.50

0.075

$195.00

$65.00

0.002

 

Notes:

 

All I/O measurements obtained using SQLIO

Random I/O measurements using 8k pages

Sequential measurements using 64kB I/Os

I/O density is average of read and write performance divided by capacity

Price calculations based upon average of selling price range listed.

Source: Anandtech (http://it.anandtech.com/IT/showdoc.aspx?i=3532&p=1)

 

Looking at this data in detail, we see the Intel SSD produces extremely good Random I/O rates but we should all know that raw performance is the wrong measure. We should be looking at dollars per unit performance. By this more useful metric, the Intel SSD continues to look very good at $17.66 $/MB/s on 8K read I/Os whereas the HDDs are $142 and $195 $/MB/s respectively.  For hot random workloads, SSDs are a clear win.

 

What do I mean by “hot random workloads”? By hot, I mean a high number of random IOPS per GB. But, for a given storage technology, what constitutes hot?   I like to look at I/O density which is the cutoff between a given disk with a given workload being capacity bound or I/O rate bound. For example, looking at the table above we see the random I/O density for an 64GB Intel disk is 1.109 MB/s/GB.  If you are storing data where you need 1.109 MB/s of 8k I/Os per GB of capacity or better, then the Intel device will be I/O bound and you won’t be able to use all the capacity. If the workload requires less than this number, then it is capacity bound and you won’t be able to use all the IOPS on the device. For very low access rate data, HDDs are a win. For very high access rate data, SSDs will be a better price performer.

 

As it turns out, when looking at random I/O workloads, SSDs are almost always capacity bound and HDDs are almost always IOPS bound.  Understanding that we can use a simple computation to compare HDD cost vs SSD cost on your workload. Take the HDD farm cost which will be driven by the number of disks needed to support the I/O rate times the cost of the disk.  This is the storage budget needed to support your workload on HDDs. Take the size of the database and divide by the SSD capacity to get the number of SSDs required. Multiple the number of SSDs required times the price of the SSD. This is the budget required to support your workload on SSDs.  If the SSD budget is less (and it will be for hot, random workloads), then SSDs are a better choice.  Otherwise, keep using HDDs for that workload.

 

In the sequential I/O world, we can use the same technique.  Again, we look at the sequential I/O density to understand the cut off between bandwidth bound and capacity bound for a given workload.  Very hot workloads over small data sizes will be a win on SSD but as soon as the data sizes get interesting, HDDs are a more economic solution for sequential workloads.  The detailed calculation is the same. Figure out how many HDDs required to support your workload on the basis of capacity or sequential I/O rates (depending upon which is in shortest supply for your workload on that storage technology). Figure out the HDDs budget. Then do the same for SSDs and compare the numbers. What you’ll find is that, for sequential workloads, SSDs are only best value for very high I/O rates over relatively small data sizes.

 

Using these techniques and data we can see when SSDs are a win for workloads with a given access pattern.  I’ve tested this line of thinking against many workloads and find that hot, random workloads can make sense on SSDs. Pure sequential workloads almost never do unless the access patterns are very hot or the capacity required relatively small.

 

For specific workloads that are neither pure random nor pure sequential, we can figure out the storage budget to support the workload on HDDs and on SSDs as described above and do the comparison.  Using these techniques, we can step beyond the hype and let economics drive the decision.

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Saturday, April 18, 2009 10:19:30 AM (Pacific Standard Time, UTC-08:00)  #    Comments [5] - Trackback
Hardware
 Tuesday, April 14, 2009

My notes from an older talk done by Ryan Barrett on the Google App Engine Data store at Google IO last year (5/28/2008). Ryan is a co-founder of the App Engine team.

 

·         App Engine Data Store is build on Big Table.

o   Scalable structured storage

o   Not a sharded database

o   Not an RDBMS (MySQL, Oracle, etc.)

o   Not a Distributed Hash Table (DHT)

o   It IS a sharded sorted array

·         Supported operations:

o   Read

o   Write

o   Delete

o   Single row transactions (optimistic concurrency control).

o   Scans:

1.       Prefix scan

2.       Range scan

·          Primary object: Entity

o   Stored in entity table

o   Each row has a name and the row name is fully qualified /root/parent/entity/child

o   Each entity has a parent or is a root entity and may have child entities

o   Primary key is the fully qualified name and this can’t change

o   An entity can’t be reparented (it can be deleted and created with a different parent)

·         Queries:

o   Queries can be filtered on kind and Ryan says kind “is like a table” (kind can be parent, child, grandparent, …)

o   Queries can be filtered on ancestor

o   Query language is GQL (presumably Google Query Language) which is a small subset of SQL

o   All queries must be expressible as range or prefix scans (no sort, orderby, or other unbounded size operations supported)

·         Secondary index implementation:

o   Indexes are also implemented as BigTable tables

o   Kind Index:

·         Contents: (kind, key)

o   Single property index:

·         Coentents: (kind, name, value)

·         Two copies of this index maintained: 1) ascending, and 2) descending

o   Composite indexes:

·         Contents: (kind, value, value)

·         Supports multi-property indexes

·         Built on programmer request but not on use (query returns error if required doesn’t exist)

·         Programmer can specify what composite indexes are needed in index.yaml

·         SDK creates composit index specs automatically in index.yaml as queries are run

·         Entity group

o   Supports multi-entity update

·         Defined by root entity (all entities under a root are an entity group)

·         All journaling and transactions done at root

·         Text and Blobs:

o   Not indexed. All other properties are

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Tuesday, April 14, 2009 5:28:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Sunday, April 12, 2009

All new technologies go through an early phase when everyone initially is completely convinced the technology can’t work. Then for those that actually do solve interesting problems, they get adopted in some workloads and head into the next phase. In the next phase, people see the technology actually works well for some workloads and they generalize this outcome to a wider class of workloads. They get convinced the new technology is the solution for all problems. Solid State Disks (SSDs) are now clearly in this next phase. 

 

Well intentioned people are arguing emphatically that SSDs are great because they are “fast”.   For the most part, SSDs actually are faster than disks both in random reads, random writes and sequential I/O. I say “for the most part” since some SSDs have been incredibly bad at random writes. I’ve seen sequential write rates as low as ¼ that of magnetic HDDs but Gen2 SSD devices are now far better. Good devices are now delivering faster than HDD results across random read, write, and sequential I/O. It’s no longer the case that SSDs are “only good for read intensive workloads”.   

 

So, the argument that SSDs are fast is now largely true but “fast” really is a misleading measure. Performance without cost has no value.  What we need to look at is performance per unit cost.  For example, SSD sequential access performance is slightly better than most HDDs but the cost MB/s is considerably higher. It’s cheaper to obtain sequential bandwidth from multiple disks than from a single SSD.  We have to look at performance per unit cost rather than just performance.  When you hear a reference to performance as a one dimensional metric, you’re not getting a useful engineering data point.

 

When do SSDs win when looking at performance per unit dollar on the server?  Server workloads requiring very high IOPS rates per GB are more cost effective on SSDs.  Online transaction systems such as reservation systems, many ecommerce systems, and anything with small, random reads and writes can run more cost effectively on SSDs. Some time back I posted When SSDs make sense in server applications and the partner post When SSDs make sense in client applications. What I was looking at is where SSDs actually do make economic sense. But, with all the excitement around SSDs, some folks are getting a bit over exuberant and I’ve found myself in several arguments where smart people are arguing that SSDs make good economic sense in applications requiring sequential access to sizable databases. They don’t.

 

It’s time to look at where SSDs don’t make sense in server applications.  I’ve been intending to post this for months and my sloth has been rewarded.  The Microsoft Research Cambridge team recently published Migrating Server Storage to SSDs: Analysis of Tradeoffs and the authors save me some work by taking this question on. In this paper the authors look at three large server-side workloads:

1.       5000 user Exchange email server

2.       MSN Storage backend

3.       Small corporate IT workload

 

The authors show that these workloads are far more economically hosted on HDDs and I agree with their argument.  They conclude:

 

…across a range of different server workloads, replacing disks by SSDs is not a cost effective option at today’s price. Depending on the workload, the capacity/dollar of SSDs needs to improve by a factor of 3 – 3000 for SSDs to replace disks. The benefits of SSDs as an intermediate caching tier are also limited, and the cost of provisioning such a tier was justified for fewer than 10% of the examined workloads

 

They have shown that SSDs don’t make sense across a variety of server-side workloads.  Essentially that these workloads are more cost effectively hosted on HDDs. I don’t quite agree with the generalization of this argument that SSDs don’t make sense on the server-side for any workloads. They remain a win for very high IOPS OLTP databases but it’s fair to say that these workloads are a tiny minority of server-side workloads. The right way to make the decision is to figure out the storage budget for the workload to be hosted on HDD and compare that with the budget to support the workload on SSDs and make the decision on that basis.  This paper argues that the VAST majority of workloads are more economically hosted on HDDs.

 

Thanks to Zach Hill who sent this my way.

 

                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Sunday, April 12, 2009 8:31:05 AM (Pacific Standard Time, UTC-08:00)  #    Comments [10] - Trackback
Hardware
 Thursday, April 09, 2009

Last week I attended the Data Center Efficiency Summit hosted by Google. You’ll find four posting on various aspects of the summit at: http://perspectives.mvdirona.com/2009/04/05/DataCenterEfficiencySummitPosting4.aspx.

 

Two of the most interesting videos:

·         Modular Data Center Tour: http://www.youtube.com/watch?v=zRwPSFpLX8I&feature=channel

·         Data Center Water Treatment Plant: http://www.youtube.com/watch?v=nPjZvFuUKN8&feature=channel

 

A Cnet article with links to all the videos: http://news.cnet.com/8301-1001_3-10215392-92.html?tag=newsEditorsPicksArea.0.

 

The presentation I did on Data Center Efficiency Best Practices is up at: http://www.youtube.com/watch?v=m03vdyCuWS0

 

                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Thursday, April 09, 2009 7:18:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Tuesday, April 07, 2009

In the talk I gave at the Efficient Data Center Summit, I note that the hottest place on earth over recorded history was Al Aziziyah Libya in 1922 where 136F (58C) was indicated (see Data Center Efficiency Summit (Posting #4)). What’s important about this observation from a data center perspective is that this most extreme temperature event ever, is still less than the specified maximum temperatures for processors, disks, and memory.  What that means is that, with sufficient air flow, outside air without chillers could be used to cool all components in the system. Essentially, it’s a mechanical design problem. Admittedly this example is extreme but it forces us to realize that 100% free air cooling possible.  Once we understand that it’s a mechanical design problem, then we can trade off the huge savings of higher temperatures against the increased power consumption (semiconductor leakage and higher fan rates) and potentially increased server mortality rates.

 

We’ve known for years that air side economization (use of free air cooling) is possible and can limit the percentage of time that chillers need to be used. If we raise the set point in the data center, chiller usage falls quickly.  For most places on earth, a 95F (35C) set point combined with free air cooling and evaporative cooling are sufficient to eliminate the use of chillers  entirely.

 

Mitigating the risk of increased server mortality rates, we now have manufacturers beginning to warrant there equipment to run in more adverse conditions. Rackable Systems recently announced that CloudRack C2 will support full warrantee at 104F (40C): 40C (104F) in the Data Center. Ty Schmitt of Dell confirms that all Dell servers are warranted at 95F (35C) inlet temperatures. 

 

I recently came across a wonderful study done by the Intel IT department (thanks to Data Center Knowledge): reducing data center cost with an Air Economizer.

 

In this study Don Atwood and John Miner of Intel IT take the a datacenter module and divide it up into two rooms of 8 racks each. One room is run as a control with re-circulated air the their standard temperatures. The other room is run on pure outside air with the temperature allowed to range between 65F and 90F.  If the outside temp falls below 65, server heat is re-circulated to maintain 65F. If over 90F, then the air conditioning system is used to reduced to 90F.  The servers ran silicone design simulations at an average utilization rate of 90% for 10 months.

 

 

The short summary is that the server mortality rates were marginally higher – it’s not clear if the difference is statistical noise or significant – and the savings were phenomenal. It’s only four pages and worth reading: http://www.intel.com/it/pdf/Reducing_Data_Center_Cost_with_an_Air_Economizer.pdf.

 

We all need to remember that higher temperatures mean less engineering headroom and less margin for error so care needs to be shown when raising temperatures. However, it’s very clear that its worth investing in the control systems and processes necessary for high temperature operation. Big savings await and it’s good for the environment.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Tuesday, April 07, 2009 2:27:45 PM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback

 Sunday, April 05, 2009

Last week, Google hosted the Data Center Efficiency Summit.  While there, I posted a couple of short blog entries with my rough notes:

·         Data Center Efficiency Summit

·         Rough Notes: Data Center Efficiency Summit

·         Rough Notes: Data Center Efficiency Summit (posting #3)

 

In what follows, I summarize the session I presented and go into more depth on some of what I saw in sessions over the course of the day.

 

I presented Data Center Efficiency Best Practices at the 1pm session.  My basic point was that PUEs in the 1.35 range are possible and attainable without substantial complexity and without innovation.  Good solid design, using current techniques, with careful execution is sufficient to achieve this level of efficiency.

 

In the talk, I went through power distribution from high voltage at the property line to 1.2V at the CPU and showed cooling from the component level to release into the atmosphere. For electrical systems, the talk covered an ordered list of rules to increase power distribution efficiency:

1.       Avoid conversions (Less transformer steps & efficient or no UPS)

2.       Increase efficiency of conversions

3.       High voltage as close to load as possible

4.       Size voltage regulators (VRM/VRDs) to load & use efficient parts

5.       DC distribution potentially a small win (regulatory issues)

Looking at mechanical systems, the talk pointed out the gains to be had by carefully moving to higher data center temperatures.  Many server manufacturers including Dell and Rackable will fully stand behind their systems at inlet temperatures as high as 95F. Big gains are possible via elevated data center temperatures. The ordered list of mechanical systems optimizations recommended:

1.       Raise data center temperatures

2.       Tight airflow control, short paths, & large impellers

3.       Cooling towers rather than chillers

4.       Air-side economization & evaporative cooling

 

The slides from the session I presented are posted at: http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_Google2009.pdf.

 

Workshop Summary:

The overall workshop was excellent. Google showed the details behind 1) the modular data center they did 4 years ago showing both the container design and the that of the building that houses them, 2) the river water cooling system employed in their Belgium data center. And 3) the custom Google-specific server design.

 

Modular DC: The modular data center was a 45 container design where each container was 222KW (roughly 780W/sq ft). The containers were housed in a fairly conventional two floor facility.  Overall, it was nicely executed but all Google data centers built since this one have been non-modular and each subsequent design has been more efficient than this one. The fact that Google has clearly turned away from modular designs is interesting.  My read is that the design we were shown missed many opportunities to remove cost and optimize for the application of containers.  The design chosen essentially built a well executed but otherwise conventional data center shell using standard power distribution systems and standard mechanical systems. No part of the building itself optimized for containers.  Even though it was a two level design, rather than just stacking containers, a two floor shell was built. A 220 ton gantry crane further drove up costs but the crane was not fully exploited by packing the containers in tight and stacking them. 

 

For a containerized model to work economically, the attributes of the container need to be exploited rather than merely installing them in a standard data center shell. Rather than building an entire facility with multiple floors, we would need to use a much cheaper shell if any at all. The ideal would be a design where just enough concrete is poured to mount four container mounting bolts so they can be tied down to avoid wind damage. I believe the combination of not building a full shell, the use of free air cooling, and the elimination of the central mechanical system would allow containerized designs to be very cost effective. What we learn from the Google experiment is that a the combination of a conventional data center shell and mechanical systems with containers works well (their efficiency data shows it to be very good) but isn’t notably better than similar design techniques used with non-containerized designs.

 

River water cooling: The Belgium river water cooled data center caught my interest when it was first discussed a year ago.  The Google team went through the design in detail. Overall, it’s beautiful work but included a full water treatment plant to treat the water before using it.  I like the design in that its 100% better both economically and environmentally to clean and use river water rather than to take fresh water from the local utility.  But, the treatment plant itself represents a substantial capital expense and requires energy for operation. It’s clearly an innovative way to reduce fresh water consumption. However, I slightly prefer designs that depend more deeply on free air cooling and avoid the capital and operational expense of the water treatment plant.

 

Custom Server: The server design Google showed was clearly a previous generation. It’s a 2005 board and I strongly suspect there exist subsequent designs at Google that haven’t yet been shown publically.  I fully support this and think showing publically the previous generaion design is a great way to drive innovation inside a company while contributing to the industry as a whole.  I think it’s a great approach and the server that was shown last Wednesday was a very nice design.

 

The board is a 12volt only design.  This has been come more common of late with IBM, Rackable, Dell and others all doing it.  However, when the board was first designed, this was considerably less common.  12V only supplies are simpler, distributing on-board the single voltage is simpler and more efficient, and distribution losses are lower at 12v than either 3.3 or 5 for a given sized trace. Nice work.

 

Perhaps the most innovative aspect of the board design is the use of a distributed UPS. Each board has a 12V VRLA battery that can keep the server running  for 2 to 3 minutes during power failures. This is plenty of time to ride through the vast majority of power failures and is long enough to allow the generators to start, come on line, and sync.  The most important benefit of this design is it avoids the expensive central UPS system. And, it also avoids the losses of the central UPS (94% to 96% efficient UPSs are very good and most are considerably worse). Google reported their distributed UPS was 99.7% efficient. I like the design.

 

The motherboard was otherwise fairly conventional with a small level of depopulation. The second Ethernet port was deleted as was USB and other components. I like the Google approach to server design.

 

The server was designed to be rapidly serviced with the power supply, disk drives, and battery all being Velcro attached and easy to change quickly.  The board itself looks difficult to change but I suspect their newer designs will address that shortcoming.

 

Hat’s off to Google for organizing this conference to get high efficiency data center and server design techniques more broadly available across the industry. Both the board and the data center designs shown in detail where not Google’s very newest but all were excellent and well worth seeing. I like the approach of showing the previous generation technology to the industry while pushing ahead with newer work. This technique allows a company to reap the potential competitive advantages of its R&D investment while at the same time being more open with the previous generation.

 

It was a fun event and we saw lots of great work. Well done Google.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Sunday, April 05, 2009 3:37:12 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services

The HotPower ’09  workshop will be held on October 10th at the same venue and right before the Symposium on Operating Systems Principles (SOSP 2009) at Big Sky Resort Montana. Hotpower recognizes that power is becoming a central issue in the design of all systems from embedded systems to servers for high-scale data centers.

From http://hotpower09.stanford.edu/:

Power is increasingly becoming a central issue in designing systems, from embedded systems to data centers. We do not understand energy and its tradeoff with performance and other metrics very well. This limits our ability to further extend the performance envelope without violating physical constraints related to batteries, power, heat generation, or cooling.

HotPower hopes to provide a forum in which to present the latest research and to debate directions, challenges, and novel ideas about building energy-efficient computing systems. In addition, researchers coming to these issues from fields such as computer architecture, systems and networking, measurement and modeling, language and compiler design, and embedded systems will gain the opportunity to interact with and learn from one another.

 If you are interesting in submitting a paper to HotPower: http://hotpower09.stanford.edu/cfp.html.  

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Sunday, April 05, 2009 7:18:20 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Wednesday, April 01, 2009

Previous “rough notes” posting: Rough Notes: Data Center Efficiency Summit.

 

Containers Based Data Center

·         Speaker: Jimmy Clidaras

·         45 containers (222KW each/max is 250Kw – 780W/sq ft)

·         Showed pictures of containerized data centers

·         300x250’ of container hanger

·         10MW facility

·         Water side economizer

·         Chiller bybass

o   Limit chiller hours via raised temp inside

·         High efficiency transformers: 99.5%

·         27C (81F) cold aisle

·         Distributed UPS (each server has a lead accede battery).

Jimmy showed videos of the containerized data center. Show the layout of the entire facility and the detail behind the container design.  PUE is in the 1.25 range. This data center is listed as “Data Center A” in the Google PUE publications

 

Overall it was a great presentation and it’s great to see this level of detail being contributed to the industry. The day continues to be super interesting.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Wednesday, April 01, 2009 1:22:44 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback

My rough notes from the first two sessions at the Data Center Efficiency Summit at Google Mountain view earlier today:

 

Data Center Energy Going Forward

·         Speaker: John Tuccillo, APC

·         Green Grid:

o   Data Collection  & Analysis

o   Data Center Technology & Strategy

o   Data Center Operations

o   Data Center Metrics & Measurements

·         Metrics team:

o   PUE & DCiE

o   DCP: Data Center Productivity

 

Insights in Google’s PUE Results

·         Speakers: Chris Malone & Ben Jai, Google

·         Chris started off by reviewed existing data from 6 data center average quarterly and published for a year (on web):

o   All less than 1.3

o   Best at 1.16 (Google DC ‘E’)

·         Inclusion in external published data:

o   5MW or bigger and operating for more than 6 months

·         Typical PUE ~1.7

·         Google DC E

o   Mechanical: (didn't get data point)>

o   Power Distribution: 4.9%

·         Achieved by rigorous application of best practices:

o   Air-side economization

o   Water-side economization

o   Close coupled cooling

o   99.9% UPS efficiency

·         99.9% UPS Efficiency (Ben Jai presenting)

o   Distributed on-board UPS

o   Single voltage motherboard (12v)

o   Motherboard provides 5v to disk and all step downs needed by on board requirements

o   Installed a lead-acid distributed UPS to ride through power sags

o   Avoids double conversion of many central UPS

o   Only enough power in UPS to allow generators to start or to switch to other A/C supply

·         Google Measurement of PUE (Chris Malone):

o   Average DC around PUE of 2.0 in 2006

o   Sate of the art data center around 1.2 using exotic techniques

o   2 of 6 DC report daily, 4 of 6 report continuously

o   Measure at sub-station and extrapolate to utility input at substation

o   Most measurements on the server side taken at PDUs.  On newer servers, it’s measured at PDUs (more precise).

o   Accuracy of PUE measurement at +/-2%

·         Best Google facility on quarterly basis: PUE => 1.19

o   The problem with non-annual numbers is they are skewed by the impacts of changing weather conditions.  Need to annualize to gain full insight.

o   They should some impacts on PUE of weather factors and DC maintenance

o   Showed utilization at different facilities:

§  Ranged from clusters around 30% to clusters on the high end at 75% (amazingly high by industry standards).

 

Chris and Ben presented great material in this last section. Super interesting, very nice designs, and well presented.  The PUE measurement techniques look credible and the results are excellent.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Wednesday, April 01, 2009 10:04:28 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services

Google is hosting a the Efficient Data Center summit today at their Mountain View facility.  It looks like its going to be as great event and I fully expect we’ll see more detail than ever on how high scale operators run their facilities. But, in addition, one of the goals of the event is to talk about what the industry as a whole can do to increase data center efficiency. It looks like a good event.

 

9:00 am

Registration

9:30 am

Welcome

 

Urs Hoelzle, Google

9:45 am

Standards from The Green Grid

 

John Tuccillo, The Green Grid

10:30 am

Insights Into Google's PUE

 

Jimmy Clidaras & Chris Malone, Google

11:15 am

What's Next for the Data Center Industry

 

Andrew Fanara, EPA

12:00 pm

Lunch

1:00 pm

Best Practices

 

James Hamilton, Amazon Web Services

1:45 pm

Google Data Center Video Tour

 

Jimmy Clidaras & Chris Malone, Google

2:15 pm

Best Practices Q&A

 

Luiz Barroso, Moderator, Google; Ken Brill, Uptime Institute;

 

James Hamilton, Amazon Web Services; Olivier Sanche, eBay

3:00 pm

Break

3:15 pm

Sustainable Data Centers & Water Management

 

Joe Kava, Google

4:00 pm

Wrap-Up

 

I just flew back from China a day ago and, having spent more than a day in an airplane has left me with an upper respritory issue. But, I’ll work through that and, baring my voice going away entirely, I’ll be presenting at 1pm.  I expect I’ll also blog interesting points over the course of the day as well.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Wednesday, April 01, 2009 7:10:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Sunday, March 29, 2009

I participated in the Self Managing Database Systems closing panel titled Grand Challenges in Database Self-Management. Also on the panel were:

Ken Salem of  University of Waterloo (my alma mater), organized the panel and asked each of us to: “identify one substantial open problem related to self-managing databases - something that people interested in this area should be working on.  Feel free to define "database" broadly.”

 

The panel was organized to give us each 10 min to present our grand challenge followed by Audience Q&A.  As my topic, I chose: RDBMS Loosing Workloads in the Cloud. The basic premise is that very high-scale service workloads actually often do use RDBMS but they use them as simple ISAMs and full RDBMS functionality is rarely exploited.  And, many data management tasks in new domains are done by MapReduce, Memcached, or other solutions.  Basically, RDBMSs are heavily used in the cloud but only a tiny percentage of their features and many new workloads aren’t using RDBMS at all.

 

The call to action is to focus on cost. Go where the user pain is (service optimize for cost). And, as a test, if the very first thing the largest users do is shut off auto-management, the feature isn’t yet right.  We should be implementing auto-management systems that the very biggest users actually chose to use. These very large customers prioritize stability over the last few percentage of optimization. They don’t want to get called in the middle of the night when a plan changes.  My recommendation is to adopt a do no harm mantra and, failing that, detect and correct harm before it has broad impact.  Be able to revert back a failed optimization fast. Focus on the problems where human optimization is not possible. For example resource allocation is extremely dynamic. The correct amount of buffer pool, sort heap, and hash join space varies with the workload and can’t be effectively human set. This type of problem is perfect for auto-management.

 

Focus on optimizations that are  1) stable (do no harm) or 2) dynamic where you can do better than a static, human chosen setting.

 

I would also like to the see the community define “database” to be all persistent data management rather on applications written to relational interfaces.  The problem is far larger.

 

My slides are at: http://mvdirona.com/jrh/talksAndPapers/JamesHamilton_SMDB_Panel.pdf.

 

                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Sunday, March 29, 2009 11:52:48 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Saturday, March 28, 2009

Today , I’m at the Self Managing Database Systems workshop which is part of the International Conference on Data Engineering in Shanghai.  At last year’s ICDE, I participated in a panel: International Conference on Data Engineering 2008.  Earlier today, I did the SMDB keynote where I presented: Cloud Computing Economies of Scale.

 

The key points I attempted to make were:

·         Utility (Cloud) computing will be a big part of the future of server-side systems. This is a lasting and fast growing economy with clear economic gains. These workloads are already substantial and growing incredibly fast. And, it’s a new frontier where there are many new tough problems to be solved. Reminiscent of the RDBMS world 20 years ago.

·         High-scale service workloads are very different from enterprise workloads. Enterprise workloads typically have people as the number 1 cost.  Utility computing affords greater scale, a deeper investment in automation and, as a consequence, people costs are actually very low. H/W costs are dominant and power and functionally related costs are soon to take over.  The optimizations affordable in the utility computing world are much different from the enterprise computing world and the cost equations and drivers are very different.

·         The Recovery Orient Computing Model is an incredibly powerful management technique that doesn’t eliminate human administration but reduces it by a factor of 10 leaving only the interesting and tough problems. I argue that administrators that are working on only tough problems not amenable to automation are more effective, more valuable, and make less mistakes.  Drudgery are repetition drives errors.

·         If workloads are partitioned, synchronously redundant, and well monitored, they can be managed by ROC techniques with a savings of over 10x possible. This is how the best services are managed and it is a technique that will (slowly) spread to the enterprise.

·         I walk through a variety of interesting management & optimizations problems in the service world and pointed out that the current solutions are nowhere close to as good as they could be.  Huge improvements will be made over the next decade. It’s a great research area and a great area to in which to be working.

 

The slides I presented are up at: http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_SMDB2009.pdf.

 

                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Saturday, March 28, 2009 8:41:46 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Services
 Friday, March 27, 2009

There has been lots of speculation about the new name for Microsoft Search. The most prevalent speculation is that Live.com will be branded Kumo: Microsoft to Rebrand Search. Will it be Kumo?

 

Confirming that the Kumo brand is definitely the name that is being tested internally at Microsoft, I’ve noticed over the last week that the Search Engine Referral URL www.kumo.com has been showing up frequently as the source for searches that find this blog.  I suppose the brand could be changed yet again as the Microsoft internal bits are released externally. But, having been through the hassle of a brand change and know how much testing it really does require, I suspect we’re looking at the final answer with this one.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Friday, March 27, 2009 5:18:10 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Ramblings
 Thursday, March 26, 2009

Over the last couple of years, I’ve been getting more interested in Erlang as an high-scale services implementation language originally designed at Ericcson.  Back in May of last year I posted: Erlang and High-Scale System Software. 

 

The Erlang model of spawning many lightweight threads that communicate via message passing is typically less efficient than the more common shared memory and locks approach but the lightweight processes with message passing model but it is much easier to get a correct implementation using this model.  Erlang also encourages a “fail fast” programming model.  Years ago I became convinced that this design pattern is one of the best ways to get high scale systems software correct (Designing and Deploying Internet-Scale Services).   

Chris Newcombe of Amazon recently presented an excellent talk on Erlang at the Berkeley RAD Lab.  The first part of Chris’ Berkeley talk on Erlang is posted here: Erlang: Productivity and Performance (ChrisNewcombe_ErlangProductivityPerformance.pdf (298.21 KB)). The second half of Chris’ talk is posted at: http://ulf.wiger.net/weblog/wp-content/uploads/2009/01/damp09-erlang-multicore.pdf (unfortunately this link is down at the time of this posting). Update: Ulf Wiger offers a live URL for his excellent slides: http://www.cse.unsw.edu.au/~pls/damp09/damp09-wiger-keynote.pdf.

In this talk Chris gives an overview of Erlang, talks about some of the advantages of the language, and then goes through some of the performance strengths and weaknesses of Erlang.

 

                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

Thursday, March 26, 2009 6:42:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Software
 Friday, March 20, 2009

From Data Center Knowledge yesterday: Rackable Turns up the Heat, we see the beginnings of the next class of server innovations. This one is going to be important and have lasting impact. The industry will save millions of dollars and megawatts of power ignoring the capital expense reductions possible. Hat’s off to Rackable Systems to being the first to deliver. Yesterday they announced the CloudRack C2.  CloudRack is very similar to the MicroSlice offering I mentioned in the Microslice Servers posting. These are very low cost, high efficiency and high density, server offerings targeting high scale services.

 

What makes the CloudRack C2 particularly notable is they have raised the standard operating temperature range to a full 40C (104F).  Data center mechanical systems consume roughly 1/3 of all power brought into the data center:

       Data center power consumption:

      IT load (servers): 1/1.7=> 59%

      Distribution Losses: 8%

      Mechanical load(cooling): 33%

From: Where Does the Power Go?

 

The best way to make cooling more efficient is to stop doing so much of it.  I’ve been asking all server producers including Rackable to commit to full warrantee coverage for servers operating with 35C (95F) inlet temperatures.  Some think I’m nuts but a few innovators like Rackable and Dell fully understand the savings possible. Higher data center temperatures conserve energy and reduce costs. It’s good for the industry and good for the environment.

 

To fully realize these industry-wide savings we need all data center IT equipment certified for high temperature operations particularily top of rack and aggregation switches.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Friday, March 20, 2009 6:25:39 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Hardware
 Thursday, March 19, 2009

HotCloud ’09 is a workshop that will be held at the same time as USENIX ’09 (June 14 through 19, 2009). The CFP:

 

Join us in San Diego, CA, June 15, 2009, for the Workshop on Hot Topics in Cloud Computing. HotCloud '09 seeks to discuss challenges in the Cloud Computing paradigm including the design, implementation, and deployment of virtualized clouds. The workshop provides a forum for academics as well as practitioners in the field to share their experience, leverage each other's perspectives, and identify new and emerging "hot" trends in this area.

HotCloud '09 will be co-located with the 2009 USENIX Annual Technical Conference (USENIX '09), which will take place June 14–19, 2009. The exact date of the workshop will be set soon.

The call for paper is at: http://www.usenix.org/events/hotcloud09/cfp/.

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Thursday, March 19, 2009 4:22:14 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Wednesday, March 18, 2009

This the third posting in the series on heterogeneous computing. The first two were:

1.       Heterogeneous Computing using GPGPUs and FPGAs

2.       Heterogeneous Computing using GPGPUs:  NVidia GT200

 

This post looks more deeply at the AMD/ATI RV770.

 

The latest GPU from AMD/ATI is the RV770 architecture.  The processor contains 10 SIMD cores, each with 16 streaming processor (SP) units.   The SIMD cores are similar to NVidia’s Texture Processor Cluster (TPC) units (the NVidia GT200 also has 10 of these), and the 10*16 = 160 SPs are “execution thread granularity” similar to NVidia’s SP units (GT200 has 240 of these).  Unlike NVidia’s design which executes 1 instruction per thread, each SP on the RV770 executes packed 5-wide VLIW-style instructions.  For graphics and visualization workloads, floating point intensity is high enough to average about 4.2 useful operations  per cycle.  On dense data parallel operations (ex. dense matrix multiply), all 5 ALUs can easily be used.

 

The ALUs in each SP are named x, y, z, w and t.  x, y, z and w are symmetric, and capable of retiring a single precision floating point multiply-add per cycle.  The t unit is a Special Function Unit (SFU) capable of everything an xyzw ALU can do, plus transcendental functions like sin, cos, etc.  There is also a branch unit in each SP to deal with shader program branches.

 

From this information, we can see that when people are talking about 800 “shader cores” or “threads” or “streaming processors”, they are actually referring to the 10*16*5 = 800 xyzwt ALUs.  This can be confusing, because there are really only 160 simultaneous instruction pipelines.  Also, both NVidia and AMD use symmetric single issue streaming multiprocessor architectures, so branches are handled very differently from CPUs. 

 

The RV770 is used in the desktop Radeon 4850 and 4870 video cards, and evidently the “workstation” FireStream 9250 and FirePro V8700.  The Radeon 48x0 X2 “enthusiast desktop” cards have two RV770s on the same card. Like NVidia Quadro cards, the typical difference between the “desktop” and “workstation” cards is that the workstation card has anti-aliased (AA) line capability enabled (primarily for the CAD market) and it costs 5-10 times as much.    

 

[The computing cores always have AA line capability, so it’s probably more accurate to say that the desktop cards have this capability disabled.  Theoretically, foundry binning could sort processors with hard faults in the “anti-aliased line hardware” as “desktop” processors.  However, this probably never really happens since this is just a tiny bit of instruction decode logic or microcode that sends “lines” to shared setup logic that triangles are computed on.  Likewise, the NVidia Tesla boards are just GT200 processors with potentially some extra compliance testing and more (non-ECC) board memory.  Arguably, these artificially maintained high margin product lines are what keep these companies profitable; industrial design subsidizes gamers!]

 

Double precision floating point is accomplished by fusing the xyzw ALUs within an SP into two pairs.  These two double units can perform either multiply or add (but not both) each cycle.  The t unit is unaffected by this fused mode, and ALU/transcendental operations can be co-scheduled alongside the doubles just like with single precision-only VLIW issue.

 

Local card memory is 512MB of GDDR3 for the 4850 and 1GB of GDDR5 for the 4870.  Both use a 256 bit wide bus, but GDDR3 is 2 channel while GDDR5 is 4 channel.

 

Let’s look at peak performance numbers for the Radeon 4870, clocked at reference 750MHz.  Keep in mind that all of the ALUs are capable of multiply-add instructions (2 flop/cycle):

= 750MHz/s * 10 SPMD * 16 SIMD/SPMD * 5 ALU/SIMD * 2 flop/cycle per ALU

= 1200000M flop/s = 1.2 TFlop/s

For double precision:

= 750MHz/s * 10 * 16 * 2 “double FPU” * 1 Flop/cycle per “double FPU”

= 240 GFlop/s double precision + 240 GFlop/s single precision on the 160 t SFUs

 

Reference memory frequency is 900 MHz:

= 900MHz/s * 4 channels * 256 bits/channel = 115 GB/s

 

Here are peak performance numbers for some RV770 cards:

                                                Single                    Double                 Bandwidth          TDP Power          Cost

·         Radeon 4850      1000 GFlop/s      200 GFlop/s        64 GB/s                180W                     $130

·         Radeon 4870      1200                       240                         115                         200                         $180

·         4850 X2                 2000                       400                         127                         230                         $255

·         4870 X2                 2400                       480                         230                         285                         $420

·         FireStrm 9250    1000                       200                         64                           180                         $790       (same as 4850)

·         FirePro V8700    1200                       240                         115                         200                         $1130    (same as 4870)

 

The Radeon 4850 X2 is the cheapest compute capability per retail dollar available outside of DSPs and fixed function ASICs.  However, it’s bandwidth is very low compared to floating point horsepower – if it executes less than 63 floating point instructions for every F32 piece of data that must be fetched from memory, then memory bandwidth will be the bottleneck!  The 4870 is better balance at a computational intensity breakpoint of 42.  However, NVidia’s cards are applicable to a wider range of workloads; the GTX 285 has a breakpoint of 27 instructions (less compute power, more bandwidth).  For reference a Core i7 is about 16, and CPU caches are much bigger than GPU “caches” so there is a more opportunity to reuse data before fetching off-chip.

 

Thanks to Mike Marr for the research and the detailed write-up above. Errors or omissions are mine.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Wednesday, March 18, 2009 4:09:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Monday, March 16, 2009

In the last posting, Heterogeneous Computing using GPGPUs: NVidia GT200 I promised the next post would be a follow-on look at the AMD/ATI RV770.  However, over the weekend, Niraj Tolia of HP Labs sent this my way as a follow-up on the set of articles on GPGPU Programming. Prior to reading this note, I hadn’t really been interested in virtualizing GPUs but the paper caught my interest and, I’m posting my notes on it  just ahead of the RV770 architectural review that I’ll get up later in the week.

 

The paper GViM: GPU-accelerated Virtual Machines tackles the problem of implementing GPGPU programming in a virtual machine environment. The basic problem is this.  If you are running N virtual machines each of which is running 1 or more GPGPU jobs and you have less than N GPGPUs physically attached to the server, then you need to virtualize the GPGPU. As covered in the last two postings, GPUs are large, very high state devices and, consequently, hard to efficiently virtualize.

 

The approaches discussed in this paper extend the trick from that I first saw used in Virtual Interface Adapter communications and is also supported Infiniband.  I’m sure this model appeared elsewhere earlier but these are two good examples. In this networking interface model, the cost of each send and receive passing through the operating system communication path are avoided without giving up security by first making operating system calls to set up a communication path and to register buffers and door bells. The door bell is a memory location that, when written to, will cause the adapter to send the contents of the send buffer. At this point, the communications channel is set up, and all sends and receives can now be done directly in user space without further operating system interactions.  It’s a nice, secure implementation of Remote Direct Memory Access (RDMA).

 

This technique of virtualizing part of a communications adapter and mapping it into the address space of the application program, can be played out in the GPGPU world as well to allow efficient sharing of GPUs between host operating systems in a virtual machine environment.

 

The approach to this problem proposed in the paper is based upon three observations: 1) GPU calls are course grained with considerable work done between each call so overhead on the calls themselves doesn’t dominate, 2) data transfer in and out of the device is very important and can dominate if not done efficiently, and 3) high level API access to GPUs is common. Building on the third observation, they chose to virtualize at the CUDA API level and implement CUDA over a what is called in the virtual machine world, a split driver model. In the split driver model a front end, or client, device driver is loaded into the guest O/S and it does calls to the management domain (called dom0 in Xen).  In dom0, the other half of the driver is implemented. This other half of the driver makes standard CUDA calls against the physical GPUs device(s).

 

The approach taken by this paper is to implement all calls to CUDA via an interposer library that makes calls to the guest O/S driver which makes calls to the dom0 component that makes calls to the GPU. This effectively virtualizes the GPU device but the required call path is very inefficient.  The authors note that calls to CUDA are course-grained and do considerable work so the per-call inefficiency actually does get amortized out nicely as long as the data is brought to and from the device efficiently. This later point is the tough one this is where the memory mapping tricks I introduced above are used.

 

The authors proposed three solutions to getting data to and from the GPU:

1.       2-copy: user program allocates memory in the guest O/S using malloc.  Memory transferred to GPGPU must be first copied to host O/S kernel, then dom0 writes to the GPU.

2.       1-copy: user program and the device driver in the guest O/S kernel address space share a mapped memory space to avoid one copy of the two above.

3.       Bypass: Exploit the fact that GPU is 100% managed by the dom0 component of the device driver and have it call cudaMalocHost() to map all GPU memory at start-up time. This map all GPU memory into its address space. Then employ the mapping trick of point 2 above to selectively map this space into the guest application space. This has the upside of avoiding copies but the downside of statically partitioning the GPU memory space.  Each app gets access to only a portion of it. Less copying and less cost on context switch but much less memory is available for each application program.

 

Summary: By choosing to virtualize at the API layer rather than at the hardware layer, the task of virtualization was made easier with the downside that only one API is supported on this model. The authors use the split driver model to implement this level of virtualization easily on Xen exploiting the fact that there is considerable work done per CUDA call. Finally, they efficiently manage memory using the three techniques described above.

 

If you are interested in virtualization and GPGPU programming, it’s a good read with a simple and practical approach to virtualizing GPUs: http://www.cc.gatech.edu/~vishakha/files/GViM.pdf.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Monday, March 16, 2009 6:32:50 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Sunday, March 15, 2009

In Heterogeneous Computing using GPGPUs and FPGAs I looked at the Heterogeneous computing, the application of multiple instruction set architectures within a single application program under direct programmer control.  Heterogeneous computing has been around for years but  usage has been restricted to fairly small niches.  I’m predicting that we’re going to see abrupt and steep growth over the next couple of years. The combination of delivering results for many workloads cheaper, faster, and more power efficiently coupled with improved programming tools is going to vault GPGPU programming into being a much more common technique available to everyone.

 

Following on from the previous positing, Heterogeneous Computing using GPGPUs and FPGAs, in this one we’ll take a detailed look at the NVidia GT200 GPU architecture and, in the next, the AMD/ATI RV770.

 

The latest NVidia GPU is called the GT200 (“GT” stands for: Graphics Tesla).  The processor contains 10 Texture/Processor Clusters (TPC) each with 3 Single Program Multiple Data (SPDM) computing cores which NVidia calls Streaming Multiprocessors (SM).  Each has two instruction issue ports (I’ll call them Port 0 and Port 1):

·         Port 0 can issue instructions to 1 of 3 groupings of functional units on any given cycle:

o   “SIMT” (Single Instruction Multiple Thread) instructions to 8 single precision floating point units, marketed as “Stream Processors (SP) a.k.a. thread processors or shader cores

o   a double precision floating point unit

o   8 way branch unit that manages state for the SIMT execution (basically, it deals with branch instructions in shader programs)

·         Port 1 can issue instructions to two Special Function Units (SFU) each of which can process packed 4-wide vectors.  The SFUs perform transcendental operations like sin, cos, etc. or single precision multiplies (like the Intel SSE instruction: MULPS)

 

From this information, you can derive some common marketing numbers for this hardware:

·         “240 stream processors” are the 10*3*8 = 240 single precision FPUs on Port 0.

·         “30 double precision pipelines” are the 10*3* 1 = 30 double precision FPUs on Port 0.

·         “dual-issue” is the fact that you can (essentially) co-issue instructions to both Port 0 and Port 1.

 

The GT200 is used in the line of “GeForce GTX 2xx” commodity video cards (ex. GeForce GTX 280) and the Tesla C1060 [there will also be a Quadro NVS part].  The Tesla S1070 is a PCI bridge that packages four Tesla C1060s into a 1U rack unit – since it is just a bridge, it still requires a host rack unit to drive the GPUs.  The GeForce GTX 295 packages two GT200 processors on the same card (similar to AMD Radeon 48xx X2 cards).

 

Total transistor count is 1.4B – about twice the number of an Intel quad Core i7 or AMD RV770.  The GeForce GTX 2x5 parts (ex. GeForce GTX 285) are die shrunk versions of the original core: 55nm vs. 65nm.  On the original 65nm process, the GT200 was 583.2 mm2, or about 6 times the surface area of a dual-core Penryn.  A 300mm wafer produced only 94 processors (where 45nm Atom processors would yield about 2500).

 

Local card memory is GDDR3 configured as 2 channels with a bus width of 512 bits – typically 1GB.

 

The original GTX 260 was a GT200 which disabled 2 of the 10 TPC units (for a total of 24 SMs or 192 SPs) – presumably to deal with manufacturing hard faults in some of the cores.  It also disables part of the memory bus: 448 bits instead of 512 and consequently local memory is only 896MB.  [Disabling parts of a chip is a now common manufacturing strategy to more fully monetize die yields on modular circuit designs – Intel has been doing this for years with L2 caches.]  As the fab process improved, NVidia started shipping the GTX 260-216, which disables only 1 of the TPCs, and is apparently the only GTX260 part that is actually being manufactured nowadays (216 = 3*9*8, so refers to the number of shader cores).

 

Let’s look at peak performance numbers for the GTX 280, reference clocked at 1296 MHz.  Notice that Port 0 instructions can be multiply-adds (2 flop/cycle) and Port 1 instructions are just multiplies (1 flop/cycle):

1296 MHz/s * 30 SM * (8 SP/SM  * 2 flop/cycle per SP + 2 SFU * 4 FPU/SFU * 1 flop/cycle per FPU)

= Port 0 throughput + Port 1 throughput = 622080 Mflop/s + 311040 Mflop/s = 933 GFlop/s single precision

For double precision:

                1296MHz/s * 30 SM * 1 double precision FPU * 2 flop/cycle = 78 GFlop/s

The Port 1 units can be co-issued with double precision instructions, so can also process 311GFlop/s of single precision multiplies while doing double precision multiply-adds.  [That’s probably not terribly useful without single precision adds though.]

 

Reference memory frequency is 1107 MHz:

                1107 MHz/s * 2 channels * 512 bits/channel = 142 GB/s

 

Here are the peak performance numbers for various parts:

                                                Single Precision                 Double Precision              Bandwidth

·         GTX 260-216:      805 GFlop/s                        67 GFlop/s                          112 GB/s

·         GTX 280:              933                                         78                                           142

·         GTX 285:              1062                                       89                                           159

·         GTX 295:              1789                                       149                                         224

·         Tesla C1060:       933                                         78                                           102

Notice the GTX 285 breaks the single core 1 Teraflop/s barrier.  The Tesla card has the lowest bandwidth; this is presumably because there is 4GB of local memory instead of just 1 GB as on the GTX 285 (more memory typically requires lower bus clock rate).  Finally, notice that even the GTX 285 still gets less than twice the double precision throughput of an AMD Phenom II 940 or Intel Core i7, both of which get about 50 GFlop/s for double and don’t require sophisticated latency hiding data transfer or a complex programming model.

 

Thanks to Mike Marr for the research and the detailed write-up above. Errors or omissions are mine.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com 

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Sunday, March 15, 2009 5:18:58 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Saturday, March 14, 2009

It’s not at all uncommon to have several different instruction sets employed in a single computer. Decades ago IBM mainframes had I/O processing systems (channel processors). Most client systems have dedicated  graphics processors. Many networking cards off-load the transport stack (TCP/IP off load). These are all examples of special purpose processors used to support general computation. The application programmer doesn’t directly write code for them. 

 

I define Heterogeneous computing as the application of processors with different instruction set architectures (ISA) under direct application programmer control. Even heterogeneous processing has been around for years in that application programs have long had access to dedicated floating point coprocessors with instructions not found on the main CPU. FPUs where first shipped as coprocessors but have since been integrated on-chip with the general CPU.  FPU complexity has usually been hidden behind compilers that generated FPU instructions when needed or by math libraries that could be called directly by the application program.  

 

It’s difficult enough to program symmetric multi-processors (SMPs) where the application program runs over many identical processors in parallel.  Heterogeneous processing typically also employs more than one processor but these different processors don’t all share the same ISA. Why would anyone want to accept this complexity? Speed and efficiency. General purpose processors are, well, general.  And as a rule, general purpose processors are easy to program but considerably less efficient than specialized processors at some operations.  Graphics can be several orders  of magnitude more efficient in silicon than in software and, as a consequence, almost all graphics is done on graphics processors.  Network processing is another example of a very repetitive task where in-silicon implementations are at least an order of magnitude faster. As a consequence, it’s not unusual to see network switches where the control plane is implemented on a general purpose processor but the data plane is all done on an Application Specific Integrated Circuit (ASIC).

 

Looking at still more general systems that employ heterogeneous processing, newer supercomputers like RoadRunner, which took top spot in the super computer Top500 list last June, are good examples.  RoadRunner is a massive cluster of 6,562 X86 dual core processors and 12,241 IBM Cell Processors. The Cell processor was originally designed by Sony, Toshiba, and IBM and was first commercially used in the Sony Playstation 3. The cell processors themselves are heterogeneous components made up 9 processors, 1 control processor called a Power Processing Element (PPE) and 8 Synergistic Processing Elements (SPE). The bulk of the application performance comes from the SPEs but they can’t run without the PPE which hosts the operating system and manages the SPEs.  Although RoadRunner consumes a prodigious 2.35MW – more than a small power plant – it is actually much more efficient than comparable performing systems not using heterogeneous processing.

 

Hardware specialization can be cheaper, faster, and far more power efficient.  Traits that are hard to ignore.  Heterogeneous systems are beginning to look pretty interesting for some very important commercial workloads.  Over the last 9 months I’ve been interested in two classes of heterogeneous systems and their application to commercial workloads:

·         GPGPU: General Purpose computation on Graphics Unit Processing (GPU)

·         FPGA: Field Programmable Grid Array (FPGA) Coprocessors

 

I’ve seen both techniques used experimentally in petroleum exploration (seismic analysis) and in hedge fund analysis clusters (financial calculations). GPGPUS are being used commercially in rendering farms. Research work is active across the board.  Programming tools are emerging to make these systems easier to program. 

 

Heterogeneous computing  is being used commercially and usage is spreading rapidly.  In the next two articles I’ll post guest blog entries from Mike Marr describing the hardware architecture for two GPUs, the Nvidia GT200 and the AMD RV770. In a subsequent article I’ll look more closely at a couple of FPGA options available for mainstream heterogeneous programming.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Saturday, March 14, 2009 4:52:39 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Hardware

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<April 2009>
SunMonTueWedThuFriSat
2930311234
567891011
12131415161718
19202122232425
262728293012
3456789

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton