An interesting file system study is at this year’s USENIX Annual Technical Conference. The paper Measurement and Analysis of Large-Scale Network File System Workloads looks at CIFS remote file system access patterns from two populations. The first a large file store of 19TB serving 500 software developers and the second a medium sized file store of 3TB used by 1,000 marketing, sales, and finance users.
The authors found that file access patterns have changed since previous studies and offer 10 observations:
· Both workloads are more write-heavy than workloads studied previously
· Read-write [rather than pure read or pure write] access patterns are much frequent compared to past studies
· Bytes are transferred in much longer sequential runs than in previous studies [the lengths of sequential runs is increasing but note that the percentage of random access is increasing]
· Bytes are transferred from much larger files than previous studies [files are getting bigger]
· Files live an order of magnitude longer than in previous studies
· Most files are not re-opened once they are closed
· If a file is re-opened, it is temporally related to the previous close
· A small fraction of the clients account for a large fraction of the activity
· Files are infrequently accessed by more than one client
· Files sharing is rarely concurrent and mostly read-only
· Most file types do not have a single pattern of access
The comments in brackets above are mine. Some of the important points that spring out for me: the percentage of random access is increasing; for those accesses that are sequential, the runs are longer; file sizes are increasing, data is getting colder; file lifetimes are increasing; and client usage has very high skew.
Overall, file data has been getting colder and the write to read ratio has been increasing. The authors conclude that substantial increases in the client file caches are unlikely to help significantly based upon this data. But, since file metadata requests make up roughly 50% of all operations, larger metadata caches could be very beneficial. Log Structured File systems look increasingly like the write answer. Increasingly random access patterns make NAND flash an interesting approach. The authors didn’t directly mention it but log structured block stores (below the filesystem) is also interesting in that, like LFS, it’s a write optimized organization. And, in addition, a log structured block store tends to sequentialize writes while randomizing reads which is ideal for NAND Flash.
Thanks to Vlad Sadovsky for sending this paper my way.
James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com