Thumpers and SMART: When You Suspect A Failed Disk
Posted on November 28, 2008
While not an uncommon problem for storage arrays, Thumpers (Solaris/ZFS) in particular are susceptible to “mostly dead” disk issues. This is a situation in which a disk has not failed but IO performance or log messages give you that gut feeling that a drive needs to be swapped out. One would think that Solaris FMA (Fault Management Architecture) should detect these and handle them, but until the Fishworks team made a series of putbacks to the Nevada 90’s builds it almost never did. So when our SA gut says “swap it” but Solaris doesn’t seem to agree, what do we do?
Your drives aren’t as stupid as they look, thanks to SMART (Self-Monitoring, Analysis, and Reporting Technology). The state of SMART for SATA drives on Solaris is pretty crappy (improved via Fishworks work, but thats a different entry). Thankfully the “Sun Fire X4500 Software” CD includes an amazing utility named “hd”, provided by the SUNWhd package. This utility can do a wide variety of things, but most importantly it a) can output a logical to physical drive map (helps you know which disk is which), and b) can queiry SMART data of the drives.
If you have a Thumper and have not installed SUNWhd, here is the example that will make you download it now:
---------------------SunFireX4500------Rear---------------------------- 36: 37: 38: 39: 40: 41: 42: 43: 44: 45: 46: 47: c5t3 c5t7 c4t3 c4t7 c7t3 c7t7 c6t3 c6t7 c1t3 c1t7 c0t3 c0t7 ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: c5t2 c5t6 c4t2 c4t6 c7t2 c7t6 c6t2 c6t6 c1t2 c1t6 c0t2 c0t6 ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: c5t1 c5t5 c4t1 c4t5 c7t1 c7t5 c6t1 c6t5 c1t1 c1t5 c0t1 c0t5 ^++ ^++ ^++ ^++ ^++ ^++ ^-- ^-- ^++ ^-- ^++ ^++ 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: c5t0 c5t4 c4t0 c4t4 c7t0 c7t4 c6t0 c6t4 c1t0 c1t4 c0t0 c0t4 ^b+ ^b+ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ -------*-----------*-SunFireX4500--*---Front-----*-----------*----------
For that alone, its worth it. But wait… there’s more!
Using the same hd utility, using the -r or -R flags, you can pull all the SMART data off all the drives. The -R output gives you a single-line per disk output for easy browsing:
$ /opt/SUNWhd/hd/bin/hd -R 1 2 3 4 5 7 8 9 10 12 [ temp 194 ] 196... <--- Key 0 c5t0 0 500 55877894808 30 0 0 31 20135 0 30 673 673 26 20 35 0 0 0 0 1 c5t4 0 655 55877304979 30 0 0 33 20134 0 30 65 65 26 21 36 0 0 0 0 2 c4t0 0 824 55878746782 29 0 0 32 20134 0 29 70 70 25 21 35 0 0 0 0 3 c4t4 0 662 55877763735 29 1 0 29 20134 0 29 68 68 26 21 36 1 0 0 0 4 c7t0 0 260 55876977290 30 0 0 32 20135 0 30 71 71 26 21 36 0 0 0 0 5 c7t4 0 1201 55877436058 30 0 0 32 20135 0 30 71 71 26 21 36 0 1 0 0 6 c6t0 0 758 55878484644 30 0 0 32 20135 0 30 55 55 27 22 36 0 0 0 0 7 c6t4 0 950 55877239437 30 23 0 31 20134 0 30 72 72 26 21 36 24 0 0 0 8 c1t0 0 1442 55876780678 29 5 0 33 20134 0 29 68 68 27 21 36 5 1 0 0 9 c1t4 0 1616 55877763727 29 27 0 33 20134 0 29 67 67 26 20 36 29 18 4 0 10 c0t0 0 955 55876911756 29 0 0 32 20134 0 29 68 68 27 20 36 0 0 0 0 11 c0t4 0 1428 55877567125 29 6 0 31 20134 0 29 63 63 28 21 37 6 0 0 0
Please note the second "key" line is my addition. We'll get back to that.
To better understand this output, lets look at the more verbose -r output for just a single disk. Lets first look at a healthy disk:
15 c4t5 ====== Revision: 16 Offline status 130 Selftest status 0 Seconds to collect 10419 Time in minutes to run short selftest 1 Time in minutes to run extended selftest 174 Offline capability 91 SMART capability 3 Error logging capability 1 Checksum 0x8b Identification Status Current Worst Raw data 1 Raw read error rate 0xb 100 100 0 2 Throughput performance 0x5 110 110 789 3 Spin up time 0x7 104 104 55878484641 4 Start/Stop count 0x12 100 100 29 5 Reallocated sector count 0x33 100 100 0 7 Seek error rate 0xb 100 100 0 8 Seek time performance 0x5 136 136 31 9 Power on hours count 0x12 98 98 20134 10 Spin retry count 0x13 100 100 0 12 Device power cycle count 0x32 100 100 29 192 Power off retract count 0x32 100 100 71 193 Load cycle count 0x12 100 100 71 194 Temperature 0x2 189 189 29/ 23/ 38 (degrees C cur/min/max) 196 Reallocation event count 0x32 100 100 0 197 Current pending sector count 0x22 100 100 0 198 Scan uncorrected sector count 0x8 100 100 0 199 Ultra DMA CRC error count 0xa 200 253 0
You can find explanations of these here and there, and even the Official T13 SMART Attributes Annex (PDF)... but here is my short reference for the most important values to watch:
- 1 Raw read error rate: Count of non-corrected read errors. More errors (i.e. lower attribute value) means worse condition of disk surface. Frequency of errors appearance while reading RAW data from a disk
- 2 Throughput performance: Overall (general) throughput performance of HDD
- 5 Reallocated sector count: Quantity of remapped sectors
- 192 Power off retract count: Number of the fixed 'turning off drive' cycles (Fujitsu: Emergency Retract Cycle Count)
- 193 Load cycle count: Number of cycles into Landing Zone position
- 196 Reallocation event count: Quantity of remapping operations
- 197 Current pending sector count: Current quantity of unstable sectors (waiting for remapping)
- 198 Scan uncorrected sector count: Quantity of uncorrected errors (This is perhaps the single best value to watch.)
In my experience thus far, #1 and #5 is important to watch and a good indication that things are heading south, but are not to be considered unusual at reasonable levels. The values to really watch are 196, 197 and 198. If any of these values are non-zero things are bad. Chief of all, 198. If there was any single value that would cause me to "swap to be on the safe side", it would be 198.
Here is an example (-r) of a really jacked up drive:
22 c0t1 ====== Revision: 16 Offline status 132 Selftest status 0 Seconds to collect 10419 Time in minutes to run short selftest 1 Time in minutes to run extended selftest 174 Offline capability 91 SMART capability 3 Error logging capability 1 Checksum 0xf1 Identification Status Current Worst Raw data 1 Raw read error rate 0xb 53 53 5133961 2 Throughput performance 0x5 109 109 829 3 Spin up time 0x7 104 104 55878353565 4 Start/Stop count 0x12 100 100 29 5 Reallocated sector count 0x33 1 1 8 7 Seek error rate 0xb 100 100 0 8 Seek time performance 0x5 136 136 31 9 Power on hours count 0x12 98 98 20134 10 Spin retry count 0x13 100 100 0 12 Device power cycle count 0x32 100 100 29 192 Power off retract count 0x32 100 100 65 193 Load cycle count 0x12 100 100 65 194 Temperature 0x2 183 183 30/ 22/ 38 (degrees C cur/min/max) 196 Reallocation event count 0x32 100 100 8 197 Current pending sector count 0x22 1 1 1891 198 Scan uncorrected sector count 0x8 1 1 56254 199 Ultra DMA CRC error count 0xa 200 253 0
This drive might as well have been run over by a Mac truck. 56,254 scanned uncorrected sectors? Eject... immediately.
If you're a savvy storage admin, your keen mind is probly telling you to go and review Google's FAST paper: Failure Trends in a Large Disk Drive Population. This paper used Google's massive deployment to examine correlations between disk failures and, in particular, SMART data that might have predicted the failure. Its important to note that Google, wisely, considers a "failure" as any event in which an admin swaps the drive (errors, dead, whatever).
If you haven't read the paper, do it... now. But here is a couple of choice quotes relating to SMART:
- Scan Errors: "After the first scan error, drives are 39 times more likely to fail within 60 days"
- Reallocation Counts: "After the first reallocation, drives are over 14 times more likely to fail within 60 days"
- Offline Reallocations: "After the first offline reallocation, drives have over 21 times higher changes of failure within 60 days"
- Probational Counts: "after the first event, drives are 16 times more likely to fail within 60 days"
- Conclusions: "Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severly limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever."
While many people have read this paper and simply walked away saying "Yup, SMART is useless, yet again" I want to disagree. When you combine Google's research with your SA instinct, we arrive at a good balance. To put it another way, I don't think you should poll SMART every 5 minutes and swap a drive because you get a non-zero value, but when you feel like there is something wrong with disks in your system and just don't have proof, SMART is the answer.