Experiments on Disk Write Back Caches
Write back caches are implemented on hard disks to enhance write performance.
ATA drives, in particular, rely on write back caches to make up for the slower
performance due to slower seek-time and RPM when compared to their SCSI / Fibre
Channel counterparts. Some RAID controllers further implement write cache on the
controllers to enhance the overall performance of the system.
With write back caching turned on by default, an ATA drive can signal the
completion of writes more quickly than if it had to wait until the data was
completely transferred to the disk media. However, in the even of a failure
(such as power failure, hardware failure, etc.), data corruption may happen if
the data on the disk cache has not been flushed out to the disk media. Another
problem with ATA write back cache is that data may be flushed out to disks
out-of-order, i.e., if block A arrives to cache before block B, block B may be
flushed out to disk before block A. While turning the write back cache off for
ATA disks will avoid data corruption problem, performance will degrade. In
addition, the drives will be used in a less reliable mode, since ATA vendors do
not certify the recovery
of drives that deactivate write-back caching.
The chance of data corruption increases with RAID system that leaves write
back cache on. A RAID system write stripes that span multiple disks. Since there
is no guarantee that all data blocks in a stripe will be flushed out to disk
media, the stripe may not be consistent. In the ATA world, hardware RAID vendors
typically leave disk write back cache on by default. Some provide options to
turn write back cache off --- for example, the user manual for 3ware RAID
controller warns users that "there may be instances when you always want the
computer to wait for the drive to write all the data to disk before going to its
next task ... you must disable the write cache." (page
54-55 of 3ware RAID controller user manual).
The questions are: what are such instances and are they common? Also, what
happens when cache have not been flushed properly and the power goes out?
In this section, we report the experiments we conducted and our findings.
Experimental Setup
We use test servers with drives connected in different configurations:
- single disk with disk cache turned on
- multiple disks connected in RAID5 / RAID10 with disk cache turned on
- multiple disks connected in RAID5 / RAID10 controllers with controller cache
turned on and disk cache on (some RAID controllers come with further write cache
on the controller to enhance performance)
We tested ATA RAID as well as SCSI RAID controllers. We used EXT3 and
ReiserFS as the test file systems.
The test servers are connected to a
Network Power Switch,
which can be programmed to automatically turn power supplies on or off. Servers
will run test codes automatically upon boot up. We leave the power on for each
server for about 5 minutes during each run before powering the server down and
then up again.
The test code consists of file system operations that are metadata intensive,
i.e., the operations will constantly update the file system states. We run two
sets of programs simultaneously: a
dbench session
with 64 clients and a simple script that constantly creates directories. These
operations put loads on the file system similar to that of file servers.
We watch out for possible errors from the file systems. Write cache problems
will typically cause the metadata of the file system to be in a corrupted or
inconsistent states.
Experimental Results
The results are very consistent. Write back cache on disks or controllers
generated file system errors that rendered the file systems either corrupted or
inconsistent. Many times, the file systems can no longer be mounted. We find that the problems
appear faster (typically fewer than 10 power cycles) when cache is bigger, such
as when there is also write-back cache on the RAID controllers. Typically, we
observed problems within 50 power cycles.
Below we show some file system error messages we have observed:
Jan 28 11:04:40 localhost
kernel: attempt to access beyond end of device
Jan 28 11:04:40 localhost kernel: 09:00: rw=0, want=0, limit=156301312
Jan 28 11:04:40 localhost kernel: EXT3-fs error (device md(9,0)):
read_inode_bitmap: Cannot read inode bitmap - block_group = 576,
inode_bitmap = 4294967295
Jan 28 11:04:40 localhost kernel: EXT3-fs error (device md(9,0)) in
ext3_new_inode: IO failure
Jan 28 11:04:40 localhost kernel: attempt to access beyond end of device
Jan 28 11:04:40 localhost kernel: 09:00: rw=0, want=0, limit=156301312
Jan 28 11:04:40 localhost kernel: EXT3-fs error (device md(9,0)):
read_inode_bitmap: Cannot read inode bitmap - block_group = 576,
inode_bitmap = 4294967295
Jan 28 11:04:40 localhost kernel: EXT3-fs error (device md(9,0)) in
ext3_new_inode: IO failure
Jan 28 11:04:40 localhost kernel: attempt to access beyond end of device
Jan 28 11:04:40 localhost kernel: 09:00: rw=0, want=0, limit=156301312
Jan 28 11:04:40 localhost kernel: EXT3-fs error (device md(9,0)):
read_inode_bitmap: Cannot read inode bitmap - block_group = 576,
inode_bitmap = 4294967295
Jan 28 11:04:40 localhost kernel: EXT3-fs error (device md(9,0)) in
ext3_new_inode: IO failure
Jan 28 11:04:40 localhost kernel: attempt to access beyond end of device
Jan 28 11:04:40 localhost kernel: 09:00: rw=0, want=0, limit=156301312
Jan 28 11:04:40 localhost kernel: EXT3-fs error (device md(9,0)):
read_inode_bitmap: Cannot read inode bitmap - block_group = 576,
inode_bitmap = 4294967295 |
Figure 1: Sample errors from EXT3 file system
Jan 25 00:20:20 localhost
last message repeated 6 times
Jan 25 00:20:20 localhost kernel: vs-13060: reiserfs_update_sd: stat data of
object [137100 137105 0x0 SD] (nlink == 1) not found (pos 2)
Jan 25 00:20:20 localhost last message repeated 5 times
Jan 25 00:20:21 localhost kernel: vs-13060: reiserfs_update_sd: stat data of
object [137100 137101 0x0 SD] (nlink == 1) not found (pos 2)
Jan 25 00:20:21 localhost last message repeated 7 times
Jan 25 00:20:21 localhost kernel: vs-13060: reiserfs_update_sd: stat data of
object [137100 137102 0x0 SD] (nlink == 1) not found (pos 2)
Jan 25 00:20:21 localhost kernel: vs-13060: reiserfs_update_sd: stat data of
object [137100 137104 0x0 SD] (nlink == 1) not found (pos 2)
Jan 25 00:20:21 localhost kernel: vs-13060: reiserfs_update_sd: stat data of
object [137100 137105 0x0 SD] (nlink == 1) not found (pos 2)
Jan 25 00:20:21 localhost last message repeated 6 times
Jan 25 00:20:21 localhost kernel: PAP-5660: reiserfs_do_truncate: wrong
result -1 of search for [137100 137101 0xfffffffffffffff DIRECT]
Jan 25 00:20:21 localhost kernel: vs-5355: reiserfs_delete_solid_item:
[137100 137101 0x0 SD] not found<4>PAP-5660: reiserfs_do_truncate: wrong
result -1 of search for [137100 137102 0xfffffffffffffff DIRECT]
Jan 25 00:20:21 localhost kernel: vs-5355: reiserfs_delete_solid_item:
[137100 137102 0x0 SD] not found<4>PAP-5660: reiserfs_do_truncate: wrong
result -1 of search for [137100 137104 0xfffffffffffffff DIRECT]
Jan 25 00:20:24 localhost kernel: vs-5355: reiserfs_delete_solid_item:
[137100 137104 0x0 SD] not found<4>vs-5355: reiserfs_delete_solid_item:
[137100 137105 0x0 SD] not found<4>vs-13060: reiserfs_update_sd: stat data
of object [137072 137074 0x0 SD] (nlink == 1) not found (pos 9) |
Figure 2: Sample errors from Reiserfs file system
Jan 24 14:00:15 localhost
kernel: EXT3-fs error (device sd(8,0)): ext3_free_blocks: bit already
cleared for block 260096
Jan 24 14:00:15 localhost kernel: EXT3-fs error (device sd(8,0)):
ext3_free_blocks: Freeing blocks not in datazone - block = 1043443757, count
= 1
Jan 24 14:00:15 localhost kernel: EXT3-fs error (device sd(8,0)):
ext3_free_blocks: Freeing blocks not in datazone - block = 1043443756, count
= 1
Jan 24 14:00:15 localhost kernel: EXT3-fs error (device sd(8,0)):
ext3_free_blocks: Freeing blocks not in datazone - block = 1043443756, count
= 1
Jan 24 14:00:15 localhost kernel: Assertion failure in
journal_forget_R80fce437() at transaction.c:1208: "!jh->b_committed_data"
Jan 24 14:00:15 localhost kernel: ------------[ cut here ]------------
Jan 24 14:00:15 localhost kernel: kernel BUG at transaction.c:1208!
|
Figure 3: Assertion errors from EXT3
file system
We have reported the file system errors to the Linux community and are
working with them to better understand the nature of the problems. In our
communications, we have received confirmations that the errors we have seen are
due to disk write back cache.
The Lesson
The lesson is if write back cache is turned on, it is not difficult to create
metadata inconsistency or corruption at the file system upon power failure.
Using existing RAID solutions with write back cache turned on may lead to
irrecoverable data corruption of the file system.
References
 | Definition of Write Back Cache at
SNIA
Dictionary site:
"A
caching technique in which the completion of a write request is signaled
as soon as the data is in cache, and actual writing to non-volatile media
occurs at a later time. Write-back cache includes an inherent risk that an
application will take some action predicated on the write completion
signal, and a system failure before the data is written to non-volatile
media will cause media contents to be inconsistent with that subsequent
action. For this reason, good write-back cache implementations include
mechanisms to preserve cache contents across system failures (including
power failures) and to flush the cache at system restart time.
"
|
 |
Hdparm is a Linux utility for accessing and controlling the parameters
of ATA disks. It can be used to turn the write caching off.
Also, different perspectives on Write-Back Caches.
|
 | Discussions of Write Back Cache for hardware RAID products
 | On Write Back Caches at
Compaq Smart Array 5i Controller Q&A section
|
"Q12. |
Does the Smart Array 5i
support write-back cache? |
|
A12. |
|
No,
Compaq believes that data integrity is the most important feature of
any of our array controller products. Write-back cache is vulnerable
to power drops. With higher-end Compaq array controllers (e.g., the
Smart Array 5300 family), write-back cache is protected by a unique
removable battery backed cache daughter board. Since this is a
costly feature to implement, standard with higher end array
controllers, the Smart Array 5i does not support battery
backed write-back cache."
|
|
 | On disabling Write Back Cache for
3ware RAID controller (pg. 54-55).
"The
Escalade ATA RAID Controller gives you a choice of disabling the write
cache for your disk arrays. Write cache is used to store data locally on
the drive before it is written to the disk, allowing the computer to
continue with its next task. Enabling the write cache
results in the most
efficient access times for your computer system. There may be instances
when you always want the computer to wait for the drive to write all the
data to disk before going on to its next task. For this case, you must
disable the write cache."
|
|
 | Postings on the Internet on write back caches:
 |
Discussion of write back caches at netbsd.org.
"if you want *real* protection (that is,
metadata consistency) you must (on netbsd and linux) disable write
cache. using writeback cache on the drive, you're only protected from
some things (accidently hit reset, kernel panic). you're not protected
from power failure. i have a ups, but i still disable write cache. a ups
can fail, and a machine's psu can fail as well." |
 |
Article on ReiserFS tuning and how to work with disk write back
cache.
"If you have an UPS, enable write caching
by default, and configure your UPS daemon to automatically disable write
caching when a power failure occurs. " |
 |
Article on ReiserFS by Chris Mason.
"For performance
benchmarks, some of the new drives have write-back caching by default.
This means the drive reports a write is completed before it is actually
on the media. The block is still in the drive's cache, where the writes
can be reordered. If this happens, metadata changes might be written
before the log commit blocks, leading to corruption if the machine loses
power. It is very important to disable write-back caching on both IDE
and SCSI drives.
Some hardware RAID controllers provide a
battery-backed write-back cache that preserves the cache contents if the
system loses power. These should be safe to use, but the cache battery
should be checked often. A dramatic performance increase can be seen
with these write caches, especially for log intensive applications like
mail servers. " |
 |
Post from Echostar on disk write back cache for set top device.
"... when talking to drive
manufacturers, we are told that if the write cache is disabled, the life
of the drive is substantially reduced... In our application, (consumer
set top box) we cannot always cleanly shut down the system. " |
|
|
Acknowledgements
We like to thank members of the Linux community, in particular Hans Reiser
and Stephen Tweedie, with helping us understand write back cache issues and file
system corruptions.
|