Fault Detection and PreventionThere are many types of failures that can plague a storage server. It may come as no surprise that majority of the failures will result from mechanical devices such as hard drives and CPU/chassis fans. When CPU/chassis fan fails, it can significantly raise the temperature of the system and induce other failures in the server. Typically, a storage server is deployed in out-of-sight areas such as server rooms or even closets, so it is very import that proper failure detection mechanism is deployed to actively monitor the server. CPU and Fan failureHere are two utilities, which monitor CPU temperature and/or CPU/chassis fan:
Disk failureWhile RAID can protect against disk failure, it would be better to monitor the health of disks and to take precautionary measures before any failure happens. S.M.A.R.T., which stands for Self-Monitoring, Analysis and Reporting Technology, is an industry standard developed by a consortium of disk drive manufacturers to increase the reliablity of drives. It is a technology that enables the computer to predict the future failure of hard disk drives. Most modern hard drives ship with S.M.A.R.T. and there are commercial monitoring tools based on this technology. An open source S.M.A.R.T. tool is Smart Suite developed at UCSC. This tool has very good support for IDE drives and can provide various drive vendor specific information. For example, for a Maxtor IDE drive, the following is a sample report. From this report, we can see that this drive has 29 ATA errors and has been power cycled 1411.
This utility can also signal S.M.A.R.T. to perform self-tests, which can run on a live server. The S.M.A.R.T. self-test can detect possible disk failures before the actual event with about 70% accuracy. This utility also comes with a daemon, which actively monitors the disks and reports error to syslog. A related tool is the Smartmontools, based on the Smart Suite.
|
|
Last update: October 27, 2003. Copyright © 2003 Boon Storage Technologies, Inc. All Rights Reserved. |