Thumpers and SMART: When You Suspect A Failed Disk

Posted on November 28, 2008

While not an uncommon problem for storage arrays, Thumpers (Solaris/ZFS) in particular are susceptible to “mostly dead” disk issues. This is a situation in which a disk has not failed but IO performance or log messages give you that gut feeling that a drive needs to be swapped out. One would think that Solaris FMA (Fault Management Architecture) should detect these and handle them, but until the Fishworks team made a series of putbacks to the Nevada 90’s builds it almost never did. So when our SA gut says “swap it” but Solaris doesn’t seem to agree, what do we do?

Your drives aren’t as stupid as they look, thanks to SMART (Self-Monitoring, Analysis, and Reporting Technology). The state of SMART for SATA drives on Solaris is pretty crappy (improved via Fishworks work, but thats a different entry). Thankfully the “Sun Fire X4500 Software” CD includes an amazing utility named “hd”, provided by the SUNWhd package. This utility can do a wide variety of things, but most importantly it a) can output a logical to physical drive map (helps you know which disk is which), and b) can queiry SMART data of the drives.

If you have a Thumper and have not installed SUNWhd, here is the example that will make you download it now:

---------------------SunFireX4500------Rear----------------------------

36:   37:   38:   39:   40:   41:   42:   43:   44:   45:   46:   47:
c5t3  c5t7  c4t3  c4t7  c7t3  c7t7  c6t3  c6t7  c1t3  c1t7  c0t3  c0t7
^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++
24:   25:   26:   27:   28:   29:   30:   31:   32:   33:   34:   35:
c5t2  c5t6  c4t2  c4t6  c7t2  c7t6  c6t2  c6t6  c1t2  c1t6  c0t2  c0t6
^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++
12:   13:   14:   15:   16:   17:   18:   19:   20:   21:   22:   23:
c5t1  c5t5  c4t1  c4t5  c7t1  c7t5  c6t1  c6t5  c1t1  c1t5  c0t1  c0t5
^++   ^++   ^++   ^++   ^++   ^++   ^--   ^--   ^++   ^--   ^++   ^++
 0:    1:    2:    3:    4:    5:    6:    7:    8:    9:   10:   11:
c5t0  c5t4  c4t0  c4t4  c7t0  c7t4  c6t0  c6t4  c1t0  c1t4  c0t0  c0t4
^b+   ^b+   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++
-------*-----------*-SunFireX4500--*---Front-----*-----------*----------

For that alone, its worth it. But wait… there’s more!

Using the same hd utility, using the -r or -R flags, you can pull all the SMART data off all the drives. The -R output gives you a single-line per disk output for easy browsing:

$ /opt/SUNWhd/hd/bin/hd -R
                1  2    3           4 5 7 8  9    10 12         [ temp 194 ] 196...   <--- Key
 0 c5t0         0 500  55877894808 30 0 0 31 20135 0 30 673 673  26  20  35 0 0 0 0
 1 c5t4         0 655  55877304979 30 0 0 33 20134 0 30 65 65  26  21  36 0 0 0 0
 2 c4t0         0 824  55878746782 29 0 0 32 20134 0 29 70 70  25  21  35 0 0 0 0
 3 c4t4         0 662  55877763735 29 1 0 29 20134 0 29 68 68  26  21  36 1 0 0 0
 4 c7t0         0 260  55876977290 30 0 0 32 20135 0 30 71 71  26  21  36 0 0 0 0
 5 c7t4         0 1201  55877436058 30 0 0 32 20135 0 30 71 71  26  21  36 0 1 0 0
 6 c6t0         0 758  55878484644 30 0 0 32 20135 0 30 55 55  27  22  36 0 0 0 0
 7 c6t4         0 950  55877239437 30 23 0 31 20134 0 30 72 72  26  21  36 24 0 0 0
 8 c1t0         0 1442  55876780678 29 5 0 33 20134 0 29 68 68  27  21  36 5 1 0 0
 9 c1t4         0 1616  55877763727 29 27 0 33 20134 0 29 67 67  26  20  36 29 18 4 0
10 c0t0         0 955  55876911756 29 0 0 32 20134 0 29 68 68  27  20  36 0 0 0 0
11 c0t4         0 1428  55877567125 29 6 0 31 20134 0 29 63 63  28  21  37 6 0 0 0

Please note the second "key" line is my addition. We'll get back to that.

To better understand this output, lets look at the more verbose -r output for just a single disk. Lets first look at a healthy disk:

15 c4t5
======
Revision: 16
Offline status 130
Selftest status 0
Seconds to collect 10419
Time in minutes to run short selftest 1
Time in minutes to run extended selftest 174
Offline capability 91
SMART capability 3
Error logging capability 1
Checksum 0x8b
Identification                     Status Current Worst         Raw data
  1 Raw read error rate            0xb        100   100                0
  2 Throughput performance         0x5        110   110              789
  3 Spin up time                   0x7        104   104      55878484641
  4 Start/Stop count               0x12       100   100               29
  5 Reallocated sector count       0x33       100   100                0
  7 Seek error rate                0xb        100   100                0
  8 Seek time performance          0x5        136   136               31
  9 Power on hours count           0x12        98    98            20134
 10 Spin retry count               0x13       100   100                0
 12 Device power cycle count       0x32       100   100               29
192 Power off retract count        0x32       100   100               71
193 Load cycle count               0x12       100   100               71
194 Temperature                    0x2        189   189  29/ 23/ 38 (degrees C cur/min/max)
196 Reallocation event count       0x32       100   100                0
197 Current pending sector count   0x22       100   100                0
198 Scan uncorrected sector count  0x8        100   100                0
199 Ultra DMA CRC error count      0xa        200   253                0

You can find explanations of these here and there, and even the Official T13 SMART Attributes Annex (PDF)... but here is my short reference for the most important values to watch:

  • 1 Raw read error rate: Count of non-corrected read errors. More errors (i.e. lower attribute value) means worse condition of disk surface. Frequency of errors appearance while reading RAW data from a disk
  • 2 Throughput performance: Overall (general) throughput performance of HDD
  • 5 Reallocated sector count: Quantity of remapped sectors
  • 192 Power off retract count: Number of the fixed 'turning off drive' cycles (Fujitsu: Emergency Retract Cycle Count)
  • 193 Load cycle count: Number of cycles into Landing Zone position
  • 196 Reallocation event count: Quantity of remapping operations
  • 197 Current pending sector count: Current quantity of unstable sectors (waiting for remapping)
  • 198 Scan uncorrected sector count: Quantity of uncorrected errors (This is perhaps the single best value to watch.)

In my experience thus far, #1 and #5 is important to watch and a good indication that things are heading south, but are not to be considered unusual at reasonable levels. The values to really watch are 196, 197 and 198. If any of these values are non-zero things are bad. Chief of all, 198. If there was any single value that would cause me to "swap to be on the safe side", it would be 198.

Here is an example (-r) of a really jacked up drive:

22 c0t1
======
Revision: 16
Offline status 132
Selftest status 0
Seconds to collect 10419
Time in minutes to run short selftest 1
Time in minutes to run extended selftest 174
Offline capability 91
SMART capability 3
Error logging capability 1
Checksum 0xf1
Identification                     Status Current Worst         Raw data
  1 Raw read error rate            0xb         53    53          5133961
  2 Throughput performance         0x5        109   109              829
  3 Spin up time                   0x7        104   104      55878353565
  4 Start/Stop count               0x12       100   100               29
  5 Reallocated sector count       0x33         1     1                8
  7 Seek error rate                0xb        100   100                0
  8 Seek time performance          0x5        136   136               31
  9 Power on hours count           0x12        98    98            20134
 10 Spin retry count               0x13       100   100                0
 12 Device power cycle count       0x32       100   100               29
192 Power off retract count        0x32       100   100               65
193 Load cycle count               0x12       100   100               65
194 Temperature                    0x2        183   183  30/ 22/ 38 (degrees C cur/min/max)
196 Reallocation event count       0x32       100   100                8
197 Current pending sector count   0x22         1     1             1891
198 Scan uncorrected sector count  0x8          1     1            56254
199 Ultra DMA CRC error count      0xa        200   253                0

This drive might as well have been run over by a Mac truck. 56,254 scanned uncorrected sectors? Eject... immediately.

If you're a savvy storage admin, your keen mind is probly telling you to go and review Google's FAST paper: Failure Trends in a Large Disk Drive Population. This paper used Google's massive deployment to examine correlations between disk failures and, in particular, SMART data that might have predicted the failure. Its important to note that Google, wisely, considers a "failure" as any event in which an admin swaps the drive (errors, dead, whatever).

If you haven't read the paper, do it... now. But here is a couple of choice quotes relating to SMART:

  • Scan Errors: "After the first scan error, drives are 39 times more likely to fail within 60 days"
  • Reallocation Counts: "After the first reallocation, drives are over 14 times more likely to fail within 60 days"
  • Offline Reallocations: "After the first offline reallocation, drives have over 21 times higher changes of failure within 60 days"
  • Probational Counts: "after the first event, drives are 16 times more likely to fail within 60 days"
  • Conclusions: "Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severly limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever."

While many people have read this paper and simply walked away saying "Yup, SMART is useless, yet again" I want to disagree. When you combine Google's research with your SA instinct, we arrive at a good balance. To put it another way, I don't think you should poll SMART every 5 minutes and swap a drive because you get a non-zero value, but when you feel like there is something wrong with disks in your system and just don't have proof, SMART is the answer.