Linux Write Caching (evanjones.ca)

[ 2010-July-07 11:31 ]

I just spent a couple days trying to understand the performance of a small test program that constantly overwrites a file, treating it like a circular buffer. The key is understanding how Linux caches dirty pages in memory and writes them back. The brief summary is that Linux attempts to write dirty pages out in the background, without blocking the process doing the writing. However, once there are enough dirty pages, the process will be blocked, actively pages out to disk. The details are described elsewhere, but I will briefly summarize what I learned here. This may not be completely correct, because I didn't look at the code in any detail. Note: All the variables are files in /proc/sys/vm that are adjustable.

While the percentage of dirty pages is less than dirty_background_ratio (default: 10% on my system), then dirty pages stay in memory until they are older than dirty_expire_centisecs (default: 30 seconds). The pdflush kernel process wakes up every dirty_writeback_centisecs to flush these expired pages out.
If a writing process dirties enough pages that the percentage rises above dirty_background_ratio, then it proactively wakes pdflush to start writing data out in the background.
If the percentage of dirty pages rises above dirty_ratio (default: 20% on my system), then the writing process itself will synchronously write pages out to disk. This puts the process in "uninterruptable sleep" (indicated by a D in top). The CPU will be shown in the "iowait" state. This is actually idle time: if there were processes that needed CPU, they would be scheduled to run.
The percentages are of the total reclaimable memory (free + active + inactive from /proc/meminfo). On a 32-bit system, the "high memory" region is excluded if vm_highmem_is_dirtyable is 0 (default).

References

The Linux Page Cache and pdflush: Slightly dated but still mostly correct. Provides a detailed explanation.
Understanding the Linux Kernel, Third Edition: Useful and detailed explanations of the kernel internals. Section 15.3. (Writing Dirty Pages to Disk) is where you should start, but Section 16.1. (Reading and Writing a File) is also informative.
mm/page-writeback.c: the actual kernel source that implements this logic. See balance_dirty_pages().

Other Random Notes

On my system with the ext3 file system, the journal commit interval (default: 5 seconds) causes dirty pages to be written out by kjournald. Passing the commit=(seconds) option to mount adjusts this time. This does not appear to be a problem with ext4.
When a page is being written out, it is locked, which can cause a writing process to block.
All writers acquire a lock on the inode they are writing (i_mutex). This means at most one thread can be in the write() or fsync() system calls at a time. I suspect this means there is limited utility to having more than one write thread per file, but it is possible it gets dropped somewhere that I didn't find. See this article about fsync on a different thread being useless. A commenter shows the kernel source code where this lock is acquired. It appears that using O_DIRECT IO to a block device directly (eg. a raw disk partition) avoids this limit.