Technology Products Resources Download Contacts
 

Building low-cost and reliable Linux storage servers

We have built many low-cost and reliable Linux storage server systems. They are deployed either in customer sites or used internally for data storage and backup; they have been running for a long time without problems. In this article, we share our experience and insight on how to create reliable Linux storage systems using high volume, standard and low-cost hardware components.

We define a Linux storage server to be a server system with storage in it. The storage system can be configured to be a file server, backup server or application server. In future articles, we will discuss how to build value-added storage applications on top of a storage system.

Introduction

Many articles have already been written that provide step-by-step instructions on building servers and RAID system. In this article, we will focus on cost, reliability and performance of the system and the various tradeoffs among these factors. This article will address hardware and software issues when building a RAID-based storage server. 

In this article, we will cite several commercial hardware products and open source software components used in building a RAID storage server.  By no means are we endorsing these products or components; we are merely using them as examples.  There are many alternatives for most of these components; we would recommend the readers to experiment with components that suite their requirements.

Hardware selections

Although hard drives prices have dropped considerably in the past few years, the single most expensive component in a RAID storage server is still the hard drives.  In general, IDE / ATA drives will account for about 1/3 to 1/2 of the cost of the entire server.  For SCSI drive servers, that ratio is closer to 3/4 of the total cost.  Compared to IDE drives, SCSI drives have higher MTBF (Mean Time Between Failure) and spin at a significantly higher rate. As a result, SCSI drives require better ventilation to maintain the higher MTBF.  Another advantage of SCSI is its hot swap capability, which IDE drives currently do not support. Fortunately, Serial ATA (SATA) drives and controllers have just begun to hit the market.  By the end of 2003 Serial ATA (SATA) interface should become more popular. With proper OS support, SATA drives will support hot swap. Since cost is a major consideration, we will focus on building RAID storage system with IDE drives. 

There are many major manufactures of hard drives.  In general most drives have fairly similar performance given the same storage size, spin rate, and cache size.  As for the reliability of hard drives, that will vary depending on the model and manufacturer.  In our lab, we have had low defect rate with both Western Digital 40G WD400 and Maxtor DiamondMax 40G, 80G and 120G drives. 

For most storage requirements, a 3U rackmount RAID storage server should be more than sufficient. A 1U rackmount chassis can have up to 4 external drive bays.  A 2U rack mount chassis can have 6 to 9 external drive bays.  A 3U rack mount chassis can have up to 14 drive bays.  With 180G drives, a 3U server can have almost 2.5TB of raw storage! We prefer 2U rackmount chassis because it has considerable amount of storage, and much easier to handle and mount than a 3U chassis. We have heard complaints from customers who find 3U systems too heavy.

For the CPUs and motherboards, we recommend using P4 or Xeon systems over PIII for two reasons: cost and performance. The cost difference between a PIII CPU and P4 / Xeon CPU is small, but P4/Xeon system delivers high CPU and I/O performance. What is even more interesting is that a Xeon CPU is not that much more expensive than a P4 CPU (interestingly, we found that many still think a Xeon CPU is many times the price of a P4 CPU) --- on Feb 6, 2003, according to Price Watch, a 2.4GHz Xeon CPU is $259, a 2.4GHz P4 CPU is $180 and a 1.4GHz PIII is $193. Xeon CPUs can be used in SMP configurations, while P4 cannot. For better I/O performance, we also recommend using motherboards with 64 bit PCI slots. We have also found that 32 bit PCIs on some PIII motherboards have poor performance.

For P4 and Xeon class servers, there are two types of memory available on the market today: DDR or RDRAM.  In most benchmarks, DDR and RDRAM are very similar in performance.  In general, DDRs have a lower latency while RDRAMs have a higher throughput.  In a RAID storage server, these numbers may prove to be insignificant.  The amount of memory to install on a RAID storage server should depend on the type of application you’re planning to run.  We’ve tested servers with memory as low as 256MB, and have observed similar RAID performance compared to servers with 1GB of memory.  In general, since RAM is cheap, it is good to use more RAM in the system as the applications and operating system can take advantage of the RAM.

 

RAID selections

Hardware IDE RAID controllers have become quite popular with home and some enterprise users lately because they are fairly inexpensive relative to SCSI RAID controllers and have a respectable performance. However, in our experience, we have found that many IDE RAID controllers do not adequately address the disk write back issue with IDE disks (see our article on disk write back cache for more information). Data lost in the write back cache during a power failure or other system failure may cause irreparable damage or corruption to even journaling file systems. So if IDE RAID controllers are used, it is strongly recommended that the disk write back cache is disabled. However, we have observed that if write back cache is turned off, performance will degrade, sometimes significantly.

The alternative to hardware RAID is software RAID. Given that most RAID storage servers will have at least 4 to 6 drives, the two IDE ports typically found on the motherboard will not be sufficient. In addition, it is generally not recommended to use master-slave IDE configuration. IDE drives connected in a master-slave configuration have to share the "channel".  Thus, when an I/O request is sent to one of the drives, another I/O request cannot be issued to the other drive until the first drive responds. In addition, if there is a drive failure in one channel, the drive on the other channel may not respond I/O requests. In our lab, we have seen a performance loss of about 50% with drives connected in master-slave configurations. A good article on master/slave IDE configurations can be found in an article at PCGuide,  We have had great experience with using the Promise Ultra100 TX2 2-channel IDE controllers to connect IDE drives. Each controller costs about $25.

Under Linux, there are at least two open source options for software RAID: Logical Volume Manager (LVM) and Linux Software RAID.  LVM only supports RAID level 1, or mirroring, while Linux RAID supports RAID level 0, 1, 4, and 5.  LVM provides volume management and volume snapshots, and can have Linux Software RAID run below  it.  Since various Linux file systems do not currently address the disk write back cache issue, it is safer to run Linux Software RAID or LVM with disk write cache disabled.  This combination will compromise performance, but it will ensure file systems will not get corrupted. One shortcoming of Linux Software RAID 5 is that it regenerates parity each time the system comes back from an unclean shutdown (such as a power failure) --- the parity regeneration step is to ensure that parity data is consistent with the data (a parity write before a crash may not completely make it to disk) and can take several hours. If there is a disk failure in the meantime, inconsistent parity can occur. See our RAID5 introduction article for more in-depth discussion.

SR5 provides RAID5 and allows IDE drives to run with write cache on, while ensuring data integrity even when there are system or power failures. It offers both performance and reliability. Experiments have demonstrated that SR5 outperforms other IDE RAID solutions we tested, hardware or software. See our article on performance comparison for more details.

Software RAID solutions have to rely on CPU for computation. So compared to hardware solutions, the CPU utilization may be slightly higher for software RAID solutions. While hardware RAID relies on special XOR hardware for computing RAID5 parity, SR5 leverages the MMX™ Technology on Intel's CPU for SIMD XOR computation. On a P4 system, XOR computation can be performed at 2Gbytes/sec, well above the needs of I/O systems. In our experience, P4 / Xeon class servers provide better I/O performance over PIII servers. Since the cost of PIII systems and P4 / Xeon systems are not far apart, we recommend using P4 / Xeon systems. From the benchmarks we run, we found that the net performance of the system is actually higher for software solutions (SR5 and Linux Software RAID5)  when compared to hardware based RAID systems.

Software considerations

On top of the RAID device, one would have to decide on the file system to run.  Journaling file systems are the best choice for large RAID servers.  If a RAID server is run with traditional file system without journaling, it may take hours or perhaps days to perform fsck on the RAID after an unclean shutdown.  There are many journaling file system options available under Linux; to name a few: EXT3, ReiserFS, JFS, or XFS.  In our lab, we have experimented mainly with EXT3 and ReiserFS.  See the site at Guru Labs for a performance benchmarks between these two file systems for email server applications. There are also comprehensive performance comparisons at the ResierFS site. With EXT3, there are 3 journaling options: writeback, ordered and journal.  We recommend using at least the default option of data=ordered to ensure the file system metadata will not become corrupted after an unclean shutdown. Check out this site and its references for other file server tuning information. Another great resource on Linux file system can be found at IBM Developer Work's site.

Finally, the IDE drives can  be tuned to run at optimal performance via hdparm.  A good reference on hdparm can be found on the O'Reilly Network site.   We normally use the following options on our storage servers: hdparm -m16 -c3 -W1 /dev/hda

Performance benchmarks

Once the software is configured to your liking on the storage server, you should benchmark various component of the server.  A quick way to verify the raw throughput of the disks is use the Linux tool dd, which is useful in determining if the disks have been configured properly with the correct hdparm parameters.  dd is also useful in determining the raw throughput of the RAID device.  Typically, a properly configured IDE drive with write cache enabled will yield 20-30MB/s of throughput. To write 2 gigabyte of data to a device (DO NOT do this if you have a file system mounted on this device or have any data on it), perform 'dd if=/dev/zero of=output bs=32k count=64'.

To benchmark the entire storage server from the file system perspective, you can use dbench developed by Andrew Tridgell and PostMark developed by Network Appliance. They broadly represent the typical workload of storage system. The benchmark dbench measures file server workload and PostMark measures application workload consisting of small file operations such as emails, e-commerce, etc.

We have conducted performance comparisons of several RAID5-based storage servers using these two benchmarks. Please refer to performance comparison page for more information.

A setup

We have built and experimented with many RAID storage servers in the past.  These servers range from single P3 to dual P4 Xeon with 4 to 13 IDE drives.  Here is the configuration of such a server using newer hardware components. This configuration offers good I/O and network performances for a storage server. It can fit into a 2U chassis.

 6-Drive 720GB Dual P4 Xeon Storage Server

bullet

Supermicro Dual Xeon X5DPE-G2 motherboard (this motherboard comes with dual Gigabit Ethernet controllers)

bullet

Intel P4 Xeon 2.4GHz processors

bullet

1/2 GB ECC registered DDR266 memory

bullet

Maxtor 120GB 7,200RPM hard drive (6 drives)

bullet

Promise Ultra100 TX2 IDE controllers (2 controllers)

bullet

 460W Xeon power supply

The pricing for each component are available at websites such as Price Watch.

 In this configuration, we would put SR5 Software RAID on the 6 drives. The two pictures below show a 2U system in this configuration. This configuration is commonly used by our our customers.

Using this configuration, we are able to obtain the following benchmarks. See the performance comparison section for more details about these benchmarks.

bullet

dbench benchmark

8 clients 16 clients 32 clients 64 clients
95 MB/sec 90 MB/sec 60 MB/sec 25 MB/sec

 

bullet

postmark benchmark

Transactions
/sec
Data read
(KB/sec)
Data written
(KB/sec)
416 1024 1945

 

Last update: October 27, 2003. Copyright © 2003 Boon Storage Technologies, Inc. All Rights Reserved.