Disks, Lies, and Damn Disks

How to ensure that data written to disk, is REALLY on disk? Yeah, I know, this shouldn’t be hard but the I/O stack is deep, everyone is looking for performance, everyone is caching along the way, so it’s more interesting than you might like. If you writing code that needs to reliable write through semantics like Write Ahead Logging, then you need to ensure you are writing through to media. If you are writing to a SAN or SCSI, it’s pretty straight forward but if you are using EIDE or SATA, then things get a bit more interesting. What follows is Windows-specific but you need to be aware of these issues on non-Windows systems as well.

If it’s a SCSI disk (not SATA or EIDE), then setting FILE_FLAG_WRITE_THROUGH and FILE_FLAG_NO_BUFFERING is sufficient. FILE_FLAG_WRITE_THROUGH force all data written to the file to be written through the cache directly to disk. All writes are to the media. FILE_FLAG_NO_BUFFERING ensures that all reads come directly from the media as well by preventing any read ahead and disk caching. What’s happening behind the scenes when these parameters are specified on CreateFile() is that the filsystem and memory manager are not caching and Force Unit Access (FUA) is being sent to the device on writes to ensure they are directly to the media rather than cached in the device cache

The reason the above is not typically sufficient with EIDE and SATA drives is that FUA is dropped by the standard SATA and EIDE miniport driver. The filesystem and memory manager will respect the parameters but the device will likely still cache writes without FUA.

FUA is dropped for performance reasons since SATA and EIDE can only process one command at a time and the full flush required by FUA is slow. SCSI can process multiple commands in parallel and the flush is less expensive. Is Native Command Queuing (NCQ) the solution to the performance problem? Unfortunately, no. NCQ allows multiple commands to be sent to the drive, it gives the drive flexibility in what order to execute the commands but the restriction of only one command executing at a time remains.

What’s the solution to getting reliable writes when using commodity disks and needing guaranteed writes. The simple answer is to set the registry flag that turns off the discarding of FUA. This solve the correctness problem but at considerable performance expense. Essentially this will be semantically correct but slow due to the SATA single-command limitation and the length of time it takes to go directly to the media. Shutting of Write Cache Enable (WCE) on a per-drive basis is another option.

Another option is FlushFileBuffers() which is a call fully honored by all device types. FlushFileBuffers takes a file handle arguments and flushes the filesystem/memory manager cache for that handle and flushes the entire system volume that holds that file. This again works but is broader than required in that the entire device cache will get flushed. I’m told that you can also use FLUSH_CACHE on the device as an alternative to FlushFileBuffers() on a handle. A paper that shows the use of FLUSH_CACHE to achieve correct write ahead logging semantics is up at: Enforcing Database Recoverability on Disks that Lack Write-Through. In this paper, using SQL Server running a mini-TPC-C as a test case, the measure performance degradation of as little 2% using FLUSH_CACHE calls to the device as needed. A small price to pay for correctness.

–jrh

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh | Msft internal blog: msblogs/JamesRH

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.