As part of investigating the durability provided by cloud systems, I wanted to make sure I understood the basics. I started by reading the NVMe specification, to understand the guarantees provided by disks. The summary is that you should assume your data is corrupt between when a write is issued until after a flush or force unit access write completes. However, most programs use system calls to write data. This article looks at the guarantees provided by the Linux file APIs. It seems like this should be simple: a program calls write()
and after it completes, the data is durable. However, write()
only copies data from the application into the kernel's cache in memory. To force the data to be durable you need to use some additional mechanism. This article is a messy collection of notes about what I've learned. (The really brief summary: use fdatasync or open with O_DSYNC.) For a better and clearer overview, see LWN's Ensuring data reaches disk, which walks from application code through to the disk.
The write system call is defined in the IEEE POSIX standard as attempting to write data to a file descriptor. After it successfully returns, reads are required to return the bytes that were written, even when read or written by other processes or threads (POSIX standard write(); Rationale). There is an addition note under Thread Interactions with Regular File Operations that says "If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them." This suggests that all file I/O must effectively hold a lock.
Does that mean write is atomic? Technically, yes: future reads must return the entire contents of the write, or none of it. However, write is not required to be complete, and is allowed to only transfer part of the data. For example, we could have two threads, each appending 1024-bytes to a single file descriptor. It would be acceptable for the two writes to each only write a single byte. This is still "atomic", but also results in undesirable interleaved output. There is a great StackOverflow answer with more details.
The most straightforward way to get your data on disk is to call fsync(). It requests the operating system to transfer all modified blocks in cache to disk, along with all file metadata (e.g. access time, modification time, etc). In my opinion, that metadata is rarely useful, so you should use fdatasync unless you know you need the metadata. The fdatasync man page says it is required to flush as much metadata as necessary "for a subsequent data read to be handled correctly", which is what most applications care about.
One issue is this is not guaranteed to ensure you can find the file again. In particular, when you first create a file, you need to call fsync on the directory that contains it, otherwise it might not exist after a failure. The reason is basically that in UNIX, a file can exist in multiple directories due to hard links, so when you call fsync on a file, there is no way to tell which directories should be written out (more details). It appears that ext4 may actually fsync the directory automatically, but that might not be true for other filesystems.
The way this is implemented will vary depending on the file system. I used blktrace to examine what disk operations ext4 and xfs use. They both issue normal disk writes for both the file data and the file system journal, use a cache flush, then finish with a FUA write to the journal, probably to indicate the operation has committed. On disks that do not support FUA, this involves two cache flushes. My experiments show that fdatasync is slightly faster than fsync, and blktrace shows fdatasync tends to write a bit less data (ext4: 20 kiB for fsync vs 16 kiB for fdatasync). My experiments also show that xfs is slightly faster than ext4, and again blktrace shows it tends to flush less data (xfs: 4 kiB for fdatasync).
In my professional career, I can remember three fsync-related controversies. The first, in 2008, was that Firefox 3's UI would hang when lots of files were being written. The problem is the UI used the SQLite database to save state, which provides strong durability guarantees by calling fsync after each commit. On the ext3 filesystem of the time, fsync wrote out all dirty pages on the system, rather than just the relevant file. This meant that clicking a button in Firefox could wait for many megabytes of data to be written on magnetic disks, which could take seconds. The solution, as I understand from a blog, was to move many database commits to asynchronous background tasks. This means Firefox was previously using stronger durability guarantees than it needed, although the problem was made much worse by the ext3 filesystem.
The second controversy, in 2009, was that after a system crash, users of the new ext4 filesystem found many recently created files would have zero length, which did not happen with the older ext3 filesystem. In the previous controversy, ext3 was flushing too much data, which caused really slow fsync calls. To fix it, ext4 flushes only the relevant dirty pages to disk. For other files, it keeps them in memory for much longer to improve performance (defaulting to 30 seconds, configured with dirty_expire_centiseconds; note). This means after a crash, lots of data might be missing. The solution is to add fsyncs to applications that want to ensure data will survive crashes, since fsyncs are much more efficient with ext4. The downside is this still makes operations like installing software slower. For more details, see LWN's article, or Ted Ts'o's explaination.
The third controversy, in 2018, is that Postgres discovered that when fsync encounters an error, it can mark dirty pages as "clean", so future calls to fsync do nothing. This leaves modified pages in memory that are never written to disk. This is pretty catastrophic, since the application thinks some data has been written, but it has not. There are very few things an application can do in this rare case when fsync fails. Postgres and many other applications now crash when it happens. A paper titled Can Applications Recover from fsync Failures? published in USENIX ATC 2020 investigates the issue in detail. The best solution at the moment is to use Direct I/O with O_SYNC or O_DSYNC, which will report errors on specific write operations, but requires the application to manage buffers itself. For more details, see the LWN article or the Postgres wiki page about fsync errors.
Back to system calls for durability. Another option is to use the O_SYNC or O_DSYNC options with the open() system call. This causes every write to have the same semantics as a write followed by fsync/fdatasync, respectively. The POSIX specification calls this Synchronized I/O File Integrity Completion and Data Integrity Completion. The main advantage of this approach is that you need a single system call, instead of write followed by fdatasync. The biggest disadvantage is that all writes using that file descriptor will be synchronized, which may limit how the application code is structured.
The open() system call has an O_DIRECT option which is intended to bypass the operating system's cache, and instead do I/O directly with the disk. This means in many cases, an application's write call will translate directly into a disk command. However, in general this is not a replacement for fsync or fdatasync, since the disk itself is free to delay or cache those writes. Even worse, there are edge cases that mean O_DIRECT I/O falls back to traditional buffered I/O. The easiest solution is to also use the O_DSYNC option to open, which means each write is effectively followed by fdatasync.
It turns out that XFS somewhat recently added a "fast path" for O_DIRECT|O_DSYNC writes. If you are overwriting blocks with O_DIRECT|O_DSYNC, XFS will issue a FUA write if the device supports it, rather than using a cache flush. I used blktrace to confirm this happens on my Linux 5.4/Ubuntu 20.04 system. This should be more efficient, since it writes the minimum amount of data to disk, and uses a single operation instead of a write followed by a cache flush. I found a link to the kernel patch that implemented this in 2018, which has some discussion about implementing this optimization for other filesystems, but as far as I know XFS is the only one that does.
Linux also has sync_file_range, which can allow flushing part of a file to disk, rather than the entire file, and triggering an asynchronous flush, rather than waiting for it. However, the man page states that it is "extremely dangerous" and discourages its use. The best description of some of the differences and dangers with sync_file_range is Yoshinori Matsunobu's post about how it works. Notably, it seems that RocksDB uses this to control when the kernel flushes dirty data to disk, and still uses fdatasync to ensure durability. It has some interesting comments in its source code. For example, it appears that with zfs, the sync_file_range call does not actually flush data. Given my experience that code that is rarely used probably has bugs, I would recommend avoiding this system call without very good reasons.
My conclusion is there are basically three approaches for durable I/O. All of them require you to call fsync() on the containing directory when you first create a file.
I have not measured these carefully, and many of these differences are very small, which means they could be false, or are highly likely to change. These are roughly ordered from largest to smallest effects.