I have been testing one of Intel's "consumer" SSD (X25-M G2) to see if it stores data durably, meaning that if the disk claims the data has been written, it actually survives a power failure. This is important because you want your airline ticket to stay purchased after your buy it, even if the system crashes. The conclusion is that this SSD can lose data in power failures, even with the write cache disabled. This means that if you are using it for a database, committed transactions could be lost (meaning lost airline tickets, forgotten bank deposits, etc.). I believe this is a "bug" in this device, as this is not how disks are supposed to work. The good news is that I was able to test Intel's "enterprise" SSD (X25-E), and it seems to work as expected. Unfortunately, other SSDs have similar problems. In this article, I'll describe the bug, the real-world impact, and how I tested it. I'm still contacting some experts to see if this observation agrees with what they have found, so I'll update this if I discover new information.
Intel's X25-M G2 disk loses data approximately 25% of the time when the disk loses power, even when the disk is instructed to flush its cache (write barriers on the ext4 file system), or with the write cache disabled. While disabling the cache doesn't solve the problem, it does make it more rare; I was only able to get it to happen with writes that are 16 kB or larger. I tested three magnetic disks and Intel's X25-E, and only the X25-M G2 lost data during this test.
This really only matters if you are using this disk for a database where the data is critical. If your server loses power (either due to the power failing, or your UPS failing), you could lose data. You can reduce the chance of data loss with the following:
hdparm -W 0 /dev/sdX
). This makes it less likely that you will lose data.I tested three devices: Intel's X25-M G2 SSD (80 GB), Intel's X25-E (64 GB) and a Western Digital SATA magnetic disk (WD3200AAKS, 320 GB). Only the Intel X25-M G2 lost data, and it did so in the following conditions:
hdparm -W 0 /dev/sdb
) lost data in 2 out of 6 trials with 16 kB writes. Both failed attempts dropped writes (acked: 1977, disk: 1947 = 120 kB lost; acked: 479, disk: 449 = 120 kB lost). One of the "successful" trials resulted in Linux reporting a media error when attempting to overwrite the file, although it seemed to recover from this error.A program runs that writes a sequence number to disk, then reports the number. While it is running you "crash" the system, reboot, and check what data exists on disk. If sequence number x was reported as written, then the last value written should be x, x+1, or some partial version of x+1. If the last complete record is less than x, then data has been lost. I did the following test five times for each configuration:
Start logfilecrashserver on a workstation:
./logfilecrashserver 12345
Start minlogcrash on the system under test:
./minlogcrash tmp workstation 12345 131072
hexdump
.The output of hexdump should show that the file has at least the last record reported by logfilecrashserver
.
In order for a write to actually be durable, then all the layers in the stack need to cooperate, so that "save to disk" actually means "I mean it: make sure all the stuff is on the disk so that I can read it back if something bad happens." On Linux with the ext3 and ext4 file systems, this actually works when write barriers are enabled (the default on ext4, must be manually enabled on ext3). This feature causes the operating system to issue a CACHE FLUSH command to the disk when an application calls fsync
or fdatasync
. This is exactly what databases, and my test program, do. If the disk works correctly, it waits until all the data is actually written to the disk, then it acknowledges that the flush operation has completed.
I am not the only person to observe these problems with Intel SSDs. Others have found that the X25-E and the X25-M G2 both lose data with the write cache enabled. However, it appears that I am the first to report this type of problem with the write cache disabled, perhaps because I am the first to test writes that are larger than 4 kB? I've reported this to ext4 developers, in an attempt to make sure I'm not making an error. Similar issues have been reported for magnetic disks in the past, although with write barriers enabled (the default for ext4, use the barrier=1
mount option for ext3) this should not be a problem. However, it is still worth testing your configuration. My test program was inspired by brad's diskchecker.
mkfs.ext4 /dev/sdb1
).[ 19.465585] ata3.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [ 19.465787] ata3.01: BMDMA stat 0x24 [ 19.467408] ata3.01: failed command: READ DMA [ 19.469170] ata3.01: cmd c8/00:00:e0:37:0a/00:00:00:00:00/f0 tag 0 dma 131072 in [ 19.469171] res 51/40:00:48:38:0a/00:00:00:00:00/f0 Emask 0x9 (media error) [ 19.478607] ata3.01: status: { DRDY ERR } [ 19.481652] ata3.01: error: { UNC }
[ 70.050052] ata3.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [ 70.052523] ata3.01: failed command: WRITE DMA EXT [ 70.054965] ata3.01: cmd 35/00:00:00:48:04/00:04:00:00:00/f0 tag 0 dma 524288 out [ 70.054967] res 40/00:ff:00:00:00/00:00:00:00:00/50 Emask 0x4 (timeout) [ 70.059888] ata3.01: status: { DRDY } [ 75.110019] ata3: link is slow to respond, please be patient (ready=0) [ 80.090020] ata3: device not ready (errno=-16), forcing hardreset [ 80.090028] ata3: soft resetting link