Technology Products Resources Download Contacts
 

Fault Detection and Prevention

There are many types of failures that can plague a storage server.  It may come as no surprise that majority of the failures will result from mechanical devices such as hard drives and CPU/chassis fans.  When CPU/chassis fan fails, it can significantly raise the temperature of the system and induce other failures in the server.  Typically, a storage server is deployed in out-of-sight areas such as server rooms or even closets, so it is very import that proper failure detection mechanism is deployed to actively monitor the server. 

CPU and Fan failure

Here are two utilities, which monitor CPU temperature and/or CPU/chassis fan:

bullet

Khealthcare

bullet

ACPI Temperature Monitor

 

Disk failure

While RAID can protect against disk failure, it would be better to monitor the health of disks and to take precautionary measures before any failure happens. S.M.A.R.T., which stands for Self-Monitoring, Analysis and Reporting Technology, is an industry standard developed by a consortium of disk drive manufacturers to increase the reliablity of drives. It is a technology that enables the computer to predict the future failure of hard disk drives. Most modern hard drives ship with S.M.A.R.T. and there are commercial monitoring tools based on this technology.

An open source S.M.A.R.T. tool is Smart Suite developed at UCSC. This tool has very good support for IDE drives and can provide various drive vendor specific information. For example, for a Maxtor IDE drive, the following is a sample  report. From this report, we can see that this drive has 29 ATA errors and has been power cycled 1411.

Vendor Specific SMART Attributes with Thresholds:
Revision Number: 11
Attribute                               Flag     Value   Worst  Threshold    Raw Value
(  1)Raw Read Error Rate                0x0029   100     253      020            0
(  3)Spin Up Time                       0x0027   078     075      020            2816
(  4)Start Stop Count                   0x0032   098     098      008            1411
(  5)Reallocated Sector Ct              0x0033   100     100      020            0
(  7)Seek Error Rate                    0x000b   100     085      023            0
(  9)Power On Hours                     0x0012   092     092      001            5634
( 10)Spin Retry Count                   0x0026   100     100      000            0
( 11)Calibration Retry Count            0x0013   100     100      020            0
( 12)Power Cycle Count                  0x0032   098     098      008            1411
( 13)Read Soft Error Rate               0x000b   100     100      023            0
(194)Temperature                        0x0022   091     084      042            25
(195)Hardware ECC Recovered             0x001a   040     002      000            83806086
(196)Reallocated Event Count            0x0010   100     100      020            0
(197)Current Pending Sector             0x0032   100     100      020            0
(198)Offline Uncorrectable              0x0010   100     100      000            0
(199)UDMA CRC Error Count               0x001a   111     111      000            89
SMART Error Log:
SMART Error Logging Version: 1
Error Log Data Structure Pointer: 04
ATA Error Count: 29
Non-Fatal Count: 0

This utility can also signal S.M.A.R.T. to perform self-tests, which can run on a live server.  The S.M.A.R.T. self-test can detect possible disk failures before the actual event with about 70% accuracy.  This utility also comes with a daemon, which actively monitors the disks and reports error to syslog. A related tool is the Smartmontools, based on the Smart Suite.

               

 

Last update: October 27, 2003. Copyright © 2003 Boon Storage Technologies, Inc. All Rights Reserved.