I just spent a couple days trying to understand the performance of a small test program that constantly overwrites a file, treating it like a circular buffer. The key is understanding how Linux caches dirty pages in memory and writes them back. The brief summary is that Linux attempts to write dirty pages out in the background, without blocking the process doing the writing. However, once there are enough dirty pages, the process will be blocked, actively pages out to disk. The details are described elsewhere, but I will briefly summarize what I learned here. This may not be completely correct, because I didn't look at the code in any detail. Note: All the variables are files in /proc/sys/vm
that are adjustable.
dirty_background_ratio
(default: 10% on my system), then dirty pages stay in memory until they are older than dirty_expire_centisecs
(default: 30 seconds). The pdflush
kernel process wakes up every dirty_writeback_centisecs
to flush these expired pages out.dirty_background_ratio
, then it proactively wakes pdflush
to start writing data out in the background.dirty_ratio
(default: 20% on my system), then the writing process itself will synchronously write pages out to disk. This puts the process in "uninterruptable sleep" (indicated by a D
in top
). The CPU will be shown in the "iowait" state. This is actually idle time: if there were processes that needed CPU, they would be scheduled to run./proc/meminfo
). On a 32-bit system, the "high memory" region is excluded if vm_highmem_is_dirtyable
is 0 (default).balance_dirty_pages()
.ext3
file system, the journal commit interval (default: 5 seconds) causes dirty pages to be written out by kjournald
. Passing the commit=(seconds)
option to mount
adjusts this time. This does not appear to be a problem with ext4
.i_mutex
). This means at most one thread can be in the write()
or fsync()
system calls at a time. I suspect this means there is limited utility to having more than one write thread per file, but it is possible it gets dropped somewhere that I didn't find. See this article about fsync on a different thread being useless. A commenter shows the kernel source code where this lock is acquired. It appears that using O_DIRECT
IO to a block device directly (eg. a raw disk partition) avoids this limit.