James Hamilton's Blog RSS 2.0
 Friday, July 04, 2008

Recently results from two academic researchers in Japan will be significant to the NAND Flash market: http://www.electronicsweekly.com/Articles/Article.aspx?liArticleID=44028&PrinterFriendly=true.  Clearly the trip from laboratory to volume production is often longer than the early estimates but these results look important. 

 

Back in 2006, Jim Gray argued in Tape is Dead, Disk is Tape, Flash is Disk, & Ram Locality is King that we need a new layer in the storage hierarchy between memory and disk and NAND Flash was an excellent candidate.  Early NAND Flash-based SSDs could sustain read rates well beyond 10x of disk random IO rates but the write rates were terrible. Some were as bad as 1/5 the rate of magnetic disk. Second generation devices are solving the random write problem as expected.  Costs continue to plunge, overall performance continue to improve, and many very high scale server workloads have been deploying flash devices over the past year. A success by most measures but two issues remain. The first issue is that we have one important metric heading in the wrong direction: endurance.  NAND flash can only support a limited number of erase cycles before failing.  The second issue is that many don’t expect the feature size to be reduced below 32 nm which, were that to happen, would slow the improvement rate dramatically.

 

When I first got interested in single level cell (SLC) NAND Flash most published endurance numbers were typically in the 10^6 cycle range. Most current devices are in the 10^5 range and many see as low as 10^4 cycles on the horizon.  A million cycles is fine and will not restrict the life of the device.  100,000 cycles is closer to the line but my back of envelope numbers suggest 100k will (barely) be acceptable.  10k cycles is a problem and will restrict longevity of the device.

 

In this research work Shigeki Sakai and Ken Takeuchi show how Feroelectric gate Field Effect Transistors can dramatically improve the durability, reduce required programming voltage, improve performance, and support further generational reductions in feature size.  The device prototype they demonstrated uses 6v to program rather than 20v which may reduce the cost or increase the speed of devices slightly.  What’s most important in the demonstrated results is estimated endurance in the 10^8 cycle range which is at least three orders of magnitude better than most current generation NAND parts.  That would take endurance completely off the NAND Flash worry list. 

 

Potential feature size reduction is the other improvement of interest in this result.  Feature size reduction is the engine of Moore’s law and drives the semi-conductor economics we’ve all become used to.  Many experts don’t expect to be able to reduce NAND flash features size below roughly 30nm.  The Fe-NAND result shows potential for two more generational feature size reductions down to the 10nm range. This is important in that it drives costs reductions and we all want them to continue.

 

Fe-NAND looks extremely interesting and, if the research can be confirmed and is manufacturability, we have a very significant technology that can address the two major concerns with current generation NAND flash: 1) rapidly falling endurance, and 2) expected inability to drop down below 32nm feature size.  Flash continues to build industry momentum.

 

                                --jrh

 

Thanks to Jack Creasey for sending this my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Friday, July 04, 2008 5:55:51 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Wednesday, July 02, 2008

Updated below with additional implementation details.

 

Last week Spansion made an interesting announcement: EcoRAM, a NOR Flash based storage part in a Dual In-line Memory Module (DIMM) package. 

 

NOR Flash technology growth has been fueled by the NOR support for Execute in Place (XIP).  Unlike the NAND Flash interface, where entire memory pages need to be shifted into memory to be operated upon, NOR flash is directly addressable.  And this direct addressability allows instructions to be read and executed directly from the memory.  There is no need to shift pages out one at a time. Byte addressability and support for XIP makes NOR ideal for boot loaders, ROMs, and the control program store for consumer devices. For example, the iPod Nano uses Silicon Storage Technology 39WF800A 8-Megabyte NOR boot flash (eeTimes).

 

Since NAND flash is not byte addressable providing only a block mode interface, it is typically attached to PCs and servers as an I/O device.  The NOR support for direct byte addressability makes it a candidate for attachment as a memory rather than as a block mode I/O device and when I first read the press release I thought this was what Spansion has done in partnership with Virident Systems. It’s clear they have NOR Flash memory in a memory (DIMM) package and they refer to it throughout the press release as “memory extension”. However, upon closer inspection, it appears to require that the NOR memory DIMM packages all be installed in a separate gateway server they refer as a “Green Gateway”.  It looks like the design has all the NOR flash in this separate server and a device driver on the host to virtualize memory on the NOR flash server. Essentially it still may be accessed via an I/O interface which is to say it’s not clear why you couldn’t do the same thing with NAND Flash.  And, it’s not immediately clear what protocol is used, what operating systems are supported, nor the exact performance but, overall, it still looks interesting.

 

Update: In conversations with Virident, it appears this part is potentially more interesting that I initially speculated. Rather than hosting the memory in an independent server as I speculated, it’s an in-server design but does require some BIOS engineering. From Virident: The current interconnect is the HTx bus for AMD servers.  Will be QPI for Intel.  We are doing AMD first.  You should be able to install on a standard two socket board.  DRAM sits behind the processor, and EcoRAM sits behind the controller.  Of course, the BIOS for the system must support the extended memory – we have HP systems up and running as a proof of concept, and Dell should work fairly soon. 

 

HTX and QPI open up big opportunities for hardware startups to innovate.  I know of many startups heading down this path. More innovation coming.

 

EcoRAM looks like it’s worth investigating in more detail.

 

                                --jrh

 

A (slightly) more detailed presentation is available at: http://www.spansion.com/about/news/events/Transforming_the_Internet_Data_Center.pdf.  Some interesting speeds and feeds from the press release and the presentation:

 

·         1/8th the power of DRAM at a given capacity,

·         Estimating that 8x power to storage capacity advantage over DRAM will grow to a full 16x by 2012

·         10x the reliability of DRAM,

·         smaller die area per bit,

·         much closer to DRAM access times (a bit vague on this one).

 

The Achilles heel of NOR Flash has been the poor write speed.  The press release claims 2x to 10x better than traditional NOR Memories but this is still considerably slower than DRAM.

 

We need a lot more technical data and repeatable performance measures but, with what has been published so far, it would appear that the sweet spot for this device are very high random IO rate, read-mostly workloads.  Potentially fairly interesting.

 

                                                --jrh

Thanks to Son VoBa of the Windows Virtualization team for sending this my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, July 02, 2008 5:14:17 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Wednesday, June 11, 2008

Jeff Dean did a great talk at Google IO this year. Some key points from Steve Garrity (msft pm) and some note from the excellent write-up at Google spotlights data center inner workings:

·         many unreliable servers to fewer high cost servers

·         Single search query touches 700 to up to 1k machines in < 0.25sec

·         36 data centers containing > 800K servers

o   40 servers/rack

·         Typical H/W failures: Install 1000 machines and in 1 year you’ll see: 1000+ HD failures, 20 mini switch failures, 5 full switch failures, 1 PDU failure

·         There are more than 200 Google File System clusters

·         The largest BigTable instance manages about 6 petabytes of data spread across thousands of machines

·          MapReduce is increasing used within Google.

o   29,000 jobs in August 2004 and 2.2 million in September 2007

o   Average time to complete a job has dropped from 634 seconds to 395 seconds

o   Output of MapReduce tasks has risen from 193 terabytes to 14,018 terabytes

·         Typical day will run about 100,000 MapReduce jobs

o   each occupies about 400 servers

o   takes about 5 to 10 minutes to finish

 

More detail on the typical failures during the first year of a cluster from Jeff:

·         ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)

·         ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)

·         ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)

·         ~1 network rewiring (rolling ~5% of machines down over 2-day span)

·         ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)

·         ~5 racks go wonky (40-80 machines see 50% packetloss)

·         ~8 network maintenances (4 might cause ~30-minute random connectivity losses)

·         ~12 router reloads (takes out DNS and external vips for a couple minutes)

·         ~3 router failures (have to immediately pull traffic for an hour)

·         ~dozens of minor 30-second blips for dns

·         ~1000 individual machine failures

·         ~thousands of hard drive failures

 

A pictorial history of Google hardware through the years starting with the current generation server hardware and working backwards from Jeff’s talk at the 2007 Seattle Scalability Conference:

Current Generation Google Servers

 

Google Servers 2001

 

Google Servers 2000

 

Google Servers 1999

 

Google Servers 1997

My general rule on hardware is that, if you have a viewing window into the data center, you are probably spending too much on servers. The Google model of cheap servers with software redundancy is the only economic solution at scale.

 

Other notes from Google IO:

·         http://perspectives.mvdirona.com/2008/05/29/RoughNotesFromSelectedSessionsAtGoogleIODay1.aspx

·         http://perspectives.mvdirona.com/2008/05/29/IO2008RoughNotesFromMarissaMayerDay2KeynoteAtGoogleIO.aspx

·         http://perspectives.mvdirona.com/2008/05/30/IO2008RoughNotesFromSelectedSessionsAtGoogleIODay2.aspx

 

All pictures above courtesy of Jeff Dean.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

 

Wednesday, June 11, 2008 4:57:56 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Hardware
 Friday, June 06, 2008

There was an interesting talk earlier today at Microsoft Research by Jason Cong of the UCLA Computer Science Department on compiling design specifications in C/C++/SystemC and user constraints into ASIC and FPGA design. The advantage of compiler based approaches include, more productivity working at a higher level, automating verification, allows optimization, and allows rapid experimentation at different frequencies and different optimization goals (performance vs power, for example). As design complexity increases higher level language and optimization have excellent potential.  Essentially the same thing is happening in hardware design as happened 30 years ago in operating systems implementation languages. High level implementation languages replace lower level as complexity climbs.  For example, the Intel Core 2 Duo processor is a 1B transistor implementation whereas the 386 was 1/1000th of that complexity at 1M.

 

Also super interesting was the example from a financial institution that is taking a software based stock analysis system where they take the hottest parts of the system and compile these to FPGA implementations. 30x faster at 1/10th the power. Very cool.  

 

Now that AMD supports hyper transport it is possible to implement custom processors with excellent overall system performance.  Intel has opened up the FSB and is also expected to offer a non-compatible hyper transport-like implementation in the future.

 

My rough notes from the talk follow:

 

·         Speaker: Jason Cong, UCLA Computer Science (cong@cs.ucla.edu)

·         Working on on-chip interconnects & communications

o   3D IC design

o   RF-interconnects

§  Note that power restrictions restrict processors to ~5GH

·         But, communications lines can scale to 100s of GH

·         Dividing communications link into 10 or more “channels” that operate at different frequencies

·         This talk focused on  ESL SystemC to FPGA compiler

·         Why?

o   700,000 lines of RTL for a 10M gate design is too much

o   Allows executable specification

o   Verification requires executable design

o   Accelerated computing or reconfigurable computing also need C/C++ based compilation/synthesis to FPGAs

§  CPUs coupled with FPGA to support common functions at high performance and lower power

·         Note that performance limited by communications (getting data to the CPU)

o   Long wires that have to be traversed in a single clock are the limiting factor

o   This research focuses on supporting multi-cycle communications

·         xPilot: Behavior-to-RTL (Register Transfer Level design) synthesis flow

o   takes behavior spec in C/SystemC to front-end compiler to SSDM

o   SSDM is optimized using standard compiler optimization (loop unrolling, strength reduction, scheduling, etc.)

o   SSDM is compiled to:

§  Verilog/VHDL/SystemC

§  FPGAs: Altera, Xilinx

§  ASICs: Magma, Synopsys

o   UPS: Uniform Power Specification

o   During final compilation optimize for power and shut off compute units that are not being used and shut off those that are being during idle periods (a busy disk controller is frequently waiting for mechanicals and not need to execute instructions)

§  Can’t shut FPGAs but can with ASICs

§  The only solution to dynamic power leakage only solution is shut the component of

o   Allows faster experimentation than hand coding.  You can try different frequencies and different power optimizations (too complex for most humans)

o   Scheduling (allocation of operations to compute logic and specific clock cycles) is NP-complete and automated techniques can exceed quality of expert designs

·         Example:

o   Schedule the behavior to RTL using the following characterization, cycle time, constraints, and objectives:

§  Platform characterization: adder (2ns) & multiplier (5ns)

§  Target cycle time (10ns)

§  Resource constraint: only one multiplier is available

§  Objective: high performance or lower power as examples

·         Note as optimizing and reducing component counts, less space is required, which can allow faster clocking

·         Investigating compilation for Reconfigurable Accelerated Computing

o   Take GCC 3DES implementation and synthesis FPGS RTL description

o   Example took 3DES from C level implementation to a FPGA (Xilinx Virtex-5)

·         Investment bank is using this tool to compile financial optimizations from S/W implementation to FPGA accelerators (Black-Scholes S/w kernel)

o   30x speed-up over software implementation and 1/10 the power (6 vs 68W)

 

A related presentation: http://cadlab.cs.ucla.edu/~cong/slides/fpt05_xpilot_final.pdf.

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

 

Friday, June 06, 2008 10:52:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Thursday, May 01, 2008

The years of Moore’s law growth without regard to power consumption are now over. On the data center side, power isn’t close to the largest cost of running a large service but it is one of the largest controllable costs and it has been in the press frequently of late.  On the client side, battery power is the limiting factor. 

 

It is worth understanding what devices consume the most power since most laptops provide some form of user control.   Most systems allow LCD backlight dimming, the CPU power consumption can be lowered (a combination of factors including reducing clock speed and voltage), wireless radios can be switched off, and disks activity can be curtailed or eliminated.  Where does the power go? 

 

The data below was measured by Mahesri and Vardhan with an Thinkpad R40 as the system under test:

 

Device

Standby

Minimum

Maximum

CPU

11.3W

25.5W

CD-R/RW, DVD

0.0W

2.8W

5.3W

LCD Backlight

0.6W

3.5W

Wireless (802.11)

0.1W

1.0W

3.1W

HDD (40GB@4,200RPM)

0.2W

0.6W

2.8W

LCD

0.9W

1.0W

 

 Data from: http://www.crhc.uiuc.edu/~mahesri/classes/project_report_cs497yyz.pdf.

 

The dominant consumer by a significant factor is the CPU.   This power consumption is, of course, very load dependent particularly in multi-core systems where the spread between minimum and maximum power dissipation is even higher. The second largest consumer is the LCD backlight, which isn’t surprising.  Two LCD-related findings that I did find surprising: 1) the LCD without backlight is a very light consumer of power, and 2) there is a perceptible difference in power consumption between mostly black and mostly white backgrounds.   The hard disk drive power consumption was notably less than I expected with only 2.8W dissipated during active reading.

 

I wrote up more detail in: ClientSidePower6_External.doc (130 KB).

 

                                                 --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Thursday, May 01, 2008 4:49:55 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Friday, April 25, 2008

Flash SSDs in laptops have generated considerable excitement over the last year and are in use at both extremes of the  laptop market.  At the very low end, where only very small storage amounts can be funded, NAND Flash is below the below the disk price floor.  Mechanical disks with all their complexity are very difficult to manufacture for less than $30 each.  What this means is that for very small storage quantities, NAND Flash storage can actually be cheaper than mechanical disk drives even though the price per GB for Flash is larger. That’s why the One Laptop Per Child project uses NAND flash for persistent storage.  At the high end of the market, NAND flash considerably more expensive than disk but, for the premium price, offers much higher performance, more resilience to shock and high G handling, and longer battery life.

 

Recently there have been many reports of high-end SSD laptop performance problems.  Digging deeper, this is driven by two factors: 1) gen 1 SSDS produce very good read performance but aren’t particularly good on random write workloads, and 2) performance degradation over time.  The first factor can be seen clearly in this performance study using SQLIO: http://blogs.mssqltips.com/blogs/chadboyd/archive/2008/03/16/ssd-and-sql-sqlio-performance.aspx.  The poor random write performance issue is very solvable using better Flash wear leveling algorithms, reserving more space (more on this later), and capacitor backed DRAM staging areas. In fact STEC ZeusIOPS is producing great performance numbers today, Fusion IO is reporting great numbers, and many others are coming.  The first problem, that of poor random write performance, can be solved and these solutions will migrate down to the commodity drives. 

 

The second problem, the performance degradation issue, is more interesting.  There have been many reports of laptop dissatisfaction and very high return rates: Returns, technical problems high with flash-based notebooks. Dell has refuted these claims Dell: Flash notebooks are working fine but there are lingering anecdotal complaints of degrading performance. I’ve heard it enough myself that I decided to dig deeper.  I chatted off the record with an industry insider on why SSDs appear to degrade over time.  Here’s what I learned (released with their permission):

 

On a pristine NAND SSD made of quality silicon to ensure write amplification remaining at 1 [jrh: write amplification refers to the additional writes that are caused by a single write due to wear leveling and the Flash erase block sizes being considerably larger than the write page size – the goal is to get this as close to 1 as possible where 1 is no write amplification], given a not-so-primitive controller and reasonable over-provisioning (greater than 25%), a sparsely used volume (less than half full at any time) will not start showing perceptible degraded performance for a long time (perhaps as long as 5 years, the projected warranty period to be given to these SSD products).

 

If any of the above conditions is changed, the write amplification will quickly degrade ranging from 2 to 5, or even higher.  That contributes to the early start of perceptible degraded write performance.  That is, on a fairly full SSD you’d start having perceptible write performance problems more quickly, and so on.

 

Inexpensive (cheap?) SSD made of low-quality silicon will likely to have more read errors.  Error correction techniques will still guarantee correct information being returned on reads.  However, each time a read error is detected, the whole “block” of data will have to be relocated elsewhere on the device.  A not-so-well designed controller firmware will worsen the read delay, due to poorly implemented algorithms and ill-conceived space layout that take longer to search for available space for the relocated data, away from the read error area.

 

If the read-error-data-relocation happens to collide with the negative conditions that plague the write performance above, you’d start seeing overall degraded performance very quickly.

 

Chkdsk may have contributed to the forced relocation of the data away from where read errors occurred, hence improving the SSD performance (for a while) until the above collisions happen.  Perhaps the same when Defrag is used.

 

In short, performance degradation over time is unavoidable with SSD devices.  It’s a matter of how soon it kicks in and how bad it gets; and it varies across designs.

 

We expect the enterprise class SSD devices to be as much as 100% over-provisioned (e.g., a 64GB SSD actually holds 128GB of flash silicon). 

 

Summary: there are two factors in play. The first is that SSD write random performance is not great on low end parts so ensure you understand the random write I/O specification before spending on an SSD. The second one is more insidious in that, in this failure mode, the performance just degrade slowly over time.  The best way to avoid this phenomena is to 2x over-provision.  If you buy N bytes of SSD, don’t use more than ½N and consider either chkdsk or copying the data off, bulk erasing, and sequentially copying back on . We know over-provisioning is effective. The later techniques are unproven but seem likely to work. I’ll report supporting performance studies or vendor reports when either surface.

 

                                                --jrh

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

 

Friday, April 25, 2008 4:15:29 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Saturday, April 12, 2008

The only thing worse than no backups is restoring bad backups. A database guy should get these things right.  But, I didn’t, and earlier today I made some major site-wide changes and, as a side effect, this blog was restored to December 4th, 2007.  I’m working on recovering the content and will come up with something over the next 24 hours. However it’s very likely that comments between Dec 4th and earlier today will be lost.  My apologies.

 

U