Intel Consumer SSDs: Not appropriate for databases (evanjones.ca)

[ 2010-September-03 10:04 ]

I have been testing one of Intel's "consumer" SSD (X25-M G2) to see if it stores data durably, meaning that if the disk claims the data has been written, it actually survives a power failure. This is important because you want your airline ticket to stay purchased after your buy it, even if the system crashes. The conclusion is that this SSD can lose data in power failures, even with the write cache disabled. This means that if you are using it for a database, committed transactions could be lost (meaning lost airline tickets, forgotten bank deposits, etc.). I believe this is a "bug" in this device, as this is not how disks are supposed to work. The good news is that I was able to test Intel's "enterprise" SSD (X25-E), and it seems to work as expected. Unfortunately, other SSDs have similar problems. In this article, I'll describe the bug, the real-world impact, and how I tested it. I'm still contacting some experts to see if this observation agrees with what they have found, so I'll update this if I discover new information.

The Bug and the Real World

Intel's X25-M G2 disk loses data approximately 25% of the time when the disk loses power, even when the disk is instructed to flush its cache (write barriers on the ext4 file system), or with the write cache disabled. While disabling the cache doesn't solve the problem, it does make it more rare; I was only able to get it to happen with writes that are 16 kB or larger. I tested three magnetic disks and Intel's X25-E, and only the X25-M G2 lost data during this test.

This really only matters if you are using this disk for a database where the data is critical. If your server loses power (either due to the power failing, or your UPS failing), you could lose data. You can reduce the chance of data loss with the following:

Turn off the write cache (hdparm -W 0 /dev/sdX). This makes it less likely that you will lose data.
Use a UPS, and configure your server to cleanly shut down before the battery backup runs out. Test the UPS to make sure it actually works.

Detailed Summary of Results

I tested three devices: Intel's X25-M G2 SSD (80 GB), Intel's X25-E (64 GB) and a Western Digital SATA magnetic disk (WD3200AAKS, 320 GB). Only the Intel X25-M G2 lost data, and it did so in the following conditions:

With the write cache enabled (the default) lost data in 1 trial out of 5 with 4 kB writes. It lost data by "dropping" acknowledged writes (acked: 3062, disk: 2939 = 492 kB lost).
With the write cache disabled (hdparm -W 0 /dev/sdb) lost data in 2 out of 6 trials with 16 kB writes. Both failed attempts dropped writes (acked: 1977, disk: 1947 = 120 kB lost; acked: 479, disk: 449 = 120 kB lost). One of the "successful" trials resulted in Linux reporting a media error when attempting to overwrite the file, although it seemed to recover from this error.
With the cache disabled, no data loss was observed with 4 kB writes. Maybe this means that 4 kB writes are actually durable, or that I just didn't manage to observe the problematic behavior.

Testing Methodology

A program runs that writes a sequence number to disk, then reports the number. While it is running you "crash" the system, reboot, and check what data exists on disk. If sequence number x was reported as written, then the last value written should be x, x+1, or some partial version of x+1. If the last complete record is less than x, then data has been lost. I did the following test five times for each configuration:

Start logfilecrashserver on a workstation:
```
./logfilecrashserver 12345
```

Start minlogcrash on the system under test:

./minlogcrash tmp workstation 12345 131072

Once the workstation starts receiving log records, pull the power from the back of the disk.
Power off the system (my system doesn't support hotplug, so pulling the power on the disk makes it unhappy; if your system supports hotplug, this may not be needed).
Reconnected power to the disk.
Turn on the test system.
Observe the output file using hexdump.

The output of hexdump should show that the file has at least the last record reported by logfilecrashserver.

Durable Writes

In order for a write to actually be durable, then all the layers in the stack need to cooperate, so that "save to disk" actually means "I mean it: make sure all the stuff is on the disk so that I can read it back if something bad happens." On Linux with the ext3 and ext4 file systems, this actually works when write barriers are enabled (the default on ext4, must be manually enabled on ext3). This feature causes the operating system to issue a CACHE FLUSH command to the disk when an application calls fsync or fdatasync. This is exactly what databases, and my test program, do. If the disk works correctly, it waits until all the data is actually written to the disk, then it acknowledges that the flush operation has completed.

Similar Reports

I am not the only person to observe these problems with Intel SSDs. Others have found that the X25-E and the X25-M G2 both lose data with the write cache enabled. However, it appears that I am the first to report this type of problem with the write cache disabled, perhaps because I am the first to test writes that are larger than 4 kB? I've reported this to ext4 developers, in an attempt to make sure I'm not making an error. Similar issues have been reported for magnetic disks in the past, although with write barriers enabled (the default for ext4, use the barrier=1 mount option for ext3) this should not be a problem. However, it is still worth testing your configuration. My test program was inspired by brad's diskchecker.

Additional Boring Test Details

Between tests, the SSD was erased using an ATA SECURE ERASE, which should reset the disk. A new partition was created with fdisk (sectors 2048 - 156301487, the entire disk). A new ext4 partition was created with the default options (mkfs.ext4 /dev/sdb1).
Linux kernel 2.6.32 (as distributed with Ubuntu 10.04).
Also tested XFS on the Intel SSD. This also failed in the same way as ext4.

Configurations Tested

Western Digital SATA magnetic disk (WD3200AAKS, 320 GB): write cache enabled, ext4, 4 kB and 16 kB writes. Success.
Seagate PATA magnetic disk (ST380021A, 80 GB): write cache enabled, ext4, 4 kB and 16 kB writes. Success.
Intel SATA SDD (X25-M G2, 80 GB): write cache enabled ext4 4 kB writes = failure. Write cache disabled ext4 4 kB writes = success. Write cache disabled ext4 16 kB writes = failure (1/5). write cache disabled, xfs 16 kB writes = failure (2/5: acked: 276, disk: 248 = 112 kB lost; acked: 466, disk: 438 = 112 kB lost).
Intel SATA SDD (X25-E G2, 64 GB): write cache enabled ext4 4 kB writes = success. Write cache disabled ext4 4 kB writes = success. Write cache disabled ext4 256 kB writes = success.

Failure Modes Observed

HDD writes at a sector (512 byte) granularity. I did not observe any corrupted sectors, only "truncated writes" where the next write was not completed at a sector boundary.
SDD writes at a page (4 kB) granularity.
Truncated file (missing many of the last log entries).
Truncated last record (doesn't have all block_size bytes).
When reading the log, a media error is reported by the kernel and the file is not completely readable. This occurred both on the SDD and the older Seagate HDD. The hard disk reported "Sense: Unrecovered read error - auto reallocate failed." This occurred with data loss on the SDD, but without data loss on the HDD (the log entries were intact, but the remainder of the file was not). Rewriting the file recovered from this error on both the SDD and HDD.
```
[   19.465585] ata3.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[   19.465787] ata3.01: BMDMA stat 0x24
[   19.467408] ata3.01: failed command: READ DMA
[   19.469170] ata3.01: cmd c8/00:00:e0:37:0a/00:00:00:00:00/f0 tag 0 dma 131072 in
[   19.469171]          res 51/40:00:48:38:0a/00:00:00:00:00/f0 Emask 0x9 (media error)
[   19.478607] ata3.01: status: { DRDY ERR }
[   19.481652] ata3.01: error: { UNC }
```

When rewriting the log, a hang occurs, and a media error is reported by the kernel. It eventually recovered from this:

[   70.050052] ata3.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[   70.052523] ata3.01: failed command: WRITE DMA EXT
[   70.054965] ata3.01: cmd 35/00:00:00:48:04/00:04:00:00:00/f0 tag 0 dma 524288 out
[   70.054967]          res 40/00:ff:00:00:00/00:00:00:00:00/50 Emask 0x4 (timeout)
[   70.059888] ata3.01: status: { DRDY }
[   75.110019] ata3: link is slow to respond, please be patient (ready=0)
[   80.090020] ata3: device not ready (errno=-16), forcing hardreset
[   80.090028] ata3: soft resetting link