Making Writes Durable (evanjones.ca)

[ 2011-March-18 18:27 ]

When a computer tells you your data is saved, you expect it to be there, even if the power fails. It turns out this doesn't always happen because there are all sorts of caches, queues and buffers between your application and the physical platters. However, when a database tells you your data is saved, you (usually) want a strong guarantee that it is going to survive a failure (durability). Since this is hard and I've spent a fair bit of time figuring this stuff out, here are my notes on how to make writes durable on Linux.

APIs

The first step is to ensure that the application is telling the operating system to really write the data, and not just to write it to some cache somewhere. There are many ways to do this on Linux:

fsync: When this returns, Linux has ensured that any data that has been written to the file descriptor is on disk, as is the metadata. However, according to the man page, there is no guarantee that the directory entry is on disk. This means you can lose data when creating a file. Consider creating a file called "bankaccount.txt" with your precious bank transactions, then calling fsync. After it returns, the power fails. Linux has guaranteed that the bits in "bankaccount.txt" are in fact on disk, but the directory entry that lets you find that file may not have been written, so your data is lost. Thus, to be extra paranoid, after creating a file you should call fsync on the directory as well.
fdatasync: This makes similar guarantees as fsync, except that the file's metadata may not be updated, unless the metadata is needed to access the data. This basically means it won't update the last modified time, but will make sure that all the blocks of the file can be found. Since most applications don't care about this metadata, this should be the same as fsync, but with better performance in some cases.
msync(..., MS_SYNC): If you are using a memory mapped file, you can call msync with the MS_SYNC flag. This effectively calls fsync on the file (you can see the call in the kernel source; mm/msync.c). However, I wouldn't recommend using memory mapped IO for this. What happens if there is a disk error because a disk fails? Your process gets some sort of signal, probably SIGBUS. You might be able to install a signal handler to figure it out correctly, but this seems difficult. If you are using Java, the JVM crashes if there is an IO error on a MmapedByteBuffer.
open(..., ...|O_DSYNC, ...): Opening the file with the O_DSYNC flag means that effectively every write operation is followed by fdatasync. This might save you a system call, but in my experience it performs about the same.
sync_file_range: In theory, this should work as well, but for the life of me I can't understand the man page enough to understand how this could be better than fdatasync.

As a side note, I've found that I get better performance on ext3 and ext4 if I pre-fill the entire file then overwrite it. I suspect this avoids extra block allocation. This is even true on ext4 when calling fallocate, which supposedly pre-allocates space on disk. I suspect this will not be true on filesystems like btrfs or ZFS, which copy-on-write anyway.

File System Issues

For durable writes to actually be durable, the operating system must use a write barrier to inform the disk that the data really should be flushed out of cache and written to disk. For ext3, you must mount the file system with the barrier=1 option to enable write barriers. For ext4, this is enabled by default, so you must make sure you do not specify barrier=0.

Disk Issues

You must finally make sure that your disks are working correctly. I've tested about 4 "modern" magnetic disks, both SATA and SAS, and I've found that they all work correctly, even with their write caches enabled. However, I've found that cheap flash disks aren't crash safe, and you'll need to buy more expensive "enterprise" class disks. If you have a RAID controller, you must ensure that it is also configured correctly. You'll need to read the manual: there are too many options for me to describe them here, but typically you'll want to read about things like the battery backup refresh cycle, and if write caches are automatically disabled on the disks attached to the controller.