Archive for November, 2008

Thumpers and SMART: When You Suspect A Failed Disk

Friday, November 28th, 2008

While not an uncommon problem for storage arrays, Thumpers (Solaris/ZFS) in particular are susceptible to “mostly dead” disk issues. This is a situation in which a disk has not failed but IO performance or log messages give you that gut feeling that a drive needs to be swapped out. One would think that Solaris FMA (Fault Management Architecture) should detect these and handle them, but until the Fishworks team made a series of putbacks to the Nevada 90′s builds it almost never did. So when our SA gut says “swap it” but Solaris doesn’t seem to agree, what do we do?

Your drives aren’t as stupid as they look, thanks to SMART (Self-Monitoring, Analysis, and Reporting Technology). The state of SMART for SATA drives on Solaris is pretty crappy (improved via Fishworks work, but thats a different entry). Thankfully the “Sun Fire X4500 Software” CD includes an amazing utility named “hd”, provided by the SUNWhd package. This utility can do a wide variety of things, but most importantly it a) can output a logical to physical drive map (helps you know which disk is which), and b) can queiry SMART data of the drives.

If you have a Thumper and have not installed SUNWhd, here is the example that will make you download it now:

---------------------SunFireX4500------Rear----------------------------

36:   37:   38:   39:   40:   41:   42:   43:   44:   45:   46:   47:
c5t3  c5t7  c4t3  c4t7  c7t3  c7t7  c6t3  c6t7  c1t3  c1t7  c0t3  c0t7
^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++
24:   25:   26:   27:   28:   29:   30:   31:   32:   33:   34:   35:
c5t2  c5t6  c4t2  c4t6  c7t2  c7t6  c6t2  c6t6  c1t2  c1t6  c0t2  c0t6
^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++
12:   13:   14:   15:   16:   17:   18:   19:   20:   21:   22:   23:
c5t1  c5t5  c4t1  c4t5  c7t1  c7t5  c6t1  c6t5  c1t1  c1t5  c0t1  c0t5
^++   ^++   ^++   ^++   ^++   ^++   ^--   ^--   ^++   ^--   ^++   ^++
 0:    1:    2:    3:    4:    5:    6:    7:    8:    9:   10:   11:
c5t0  c5t4  c4t0  c4t4  c7t0  c7t4  c6t0  c6t4  c1t0  c1t4  c0t0  c0t4
^b+   ^b+   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++
-------*-----------*-SunFireX4500--*---Front-----*-----------*----------

For that alone, its worth it. But wait… there’s more!

Using the same hd utility, using the -r or -R flags, you can pull all the SMART data off all the drives. The -R output gives you a single-line per disk output for easy browsing:

$ /opt/SUNWhd/hd/bin/hd -R
                1  2    3           4 5 7 8  9    10 12         [ temp 194 ] 196...   <--- Key
 0 c5t0         0 500  55877894808 30 0 0 31 20135 0 30 673 673  26  20  35 0 0 0 0
 1 c5t4         0 655  55877304979 30 0 0 33 20134 0 30 65 65  26  21  36 0 0 0 0
 2 c4t0         0 824  55878746782 29 0 0 32 20134 0 29 70 70  25  21  35 0 0 0 0
 3 c4t4         0 662  55877763735 29 1 0 29 20134 0 29 68 68  26  21  36 1 0 0 0
 4 c7t0         0 260  55876977290 30 0 0 32 20135 0 30 71 71  26  21  36 0 0 0 0
 5 c7t4         0 1201  55877436058 30 0 0 32 20135 0 30 71 71  26  21  36 0 1 0 0
 6 c6t0         0 758  55878484644 30 0 0 32 20135 0 30 55 55  27  22  36 0 0 0 0
 7 c6t4         0 950  55877239437 30 23 0 31 20134 0 30 72 72  26  21  36 24 0 0 0
 8 c1t0         0 1442  55876780678 29 5 0 33 20134 0 29 68 68  27  21  36 5 1 0 0
 9 c1t4         0 1616  55877763727 29 27 0 33 20134 0 29 67 67  26  20  36 29 18 4 0
10 c0t0         0 955  55876911756 29 0 0 32 20134 0 29 68 68  27  20  36 0 0 0 0
11 c0t4         0 1428  55877567125 29 6 0 31 20134 0 29 63 63  28  21  37 6 0 0 0

Please note the second "key" line is my addition. We'll get back to that.

To better understand this output, lets look at the more verbose -r output for just a single disk. Lets first look at a healthy disk:

15 c4t5
======
Revision: 16
Offline status 130
Selftest status 0
Seconds to collect 10419
Time in minutes to run short selftest 1
Time in minutes to run extended selftest 174
Offline capability 91
SMART capability 3
Error logging capability 1
Checksum 0x8b
Identification                     Status Current Worst         Raw data
  1 Raw read error rate            0xb        100   100                0
  2 Throughput performance         0x5        110   110              789
  3 Spin up time                   0x7        104   104      55878484641
  4 Start/Stop count               0x12       100   100               29
  5 Reallocated sector count       0x33       100   100                0
  7 Seek error rate                0xb        100   100                0
  8 Seek time performance          0x5        136   136               31
  9 Power on hours count           0x12        98    98            20134
 10 Spin retry count               0x13       100   100                0
 12 Device power cycle count       0x32       100   100               29
192 Power off retract count        0x32       100   100               71
193 Load cycle count               0x12       100   100               71
194 Temperature                    0x2        189   189  29/ 23/ 38 (degrees C cur/min/max)
196 Reallocation event count       0x32       100   100                0
197 Current pending sector count   0x22       100   100                0
198 Scan uncorrected sector count  0x8        100   100                0
199 Ultra DMA CRC error count      0xa        200   253                0

You can find explanations of these here and there, and even the Official T13 SMART Attributes Annex (PDF)... but here is my short reference for the most important values to watch:

  • 1 Raw read error rate: Count of non-corrected read errors. More errors (i.e. lower attribute value) means worse condition of disk surface. Frequency of errors appearance while reading RAW data from a disk
  • 2 Throughput performance: Overall (general) throughput performance of HDD
  • 5 Reallocated sector count: Quantity of remapped sectors
  • 192 Power off retract count: Number of the fixed 'turning off drive' cycles (Fujitsu: Emergency Retract Cycle Count)
  • 193 Load cycle count: Number of cycles into Landing Zone position
  • 196 Reallocation event count: Quantity of remapping operations
  • 197 Current pending sector count: Current quantity of unstable sectors (waiting for remapping)
  • 198 Scan uncorrected sector count: Quantity of uncorrected errors (This is perhaps the single best value to watch.)

In my experience thus far, #1 and #5 is important to watch and a good indication that things are heading south, but are not to be considered unusual at reasonable levels. The values to really watch are 196, 197 and 198. If any of these values are non-zero things are bad. Chief of all, 198. If there was any single value that would cause me to "swap to be on the safe side", it would be 198.

Here is an example (-r) of a really jacked up drive:

22 c0t1
======
Revision: 16
Offline status 132
Selftest status 0
Seconds to collect 10419
Time in minutes to run short selftest 1
Time in minutes to run extended selftest 174
Offline capability 91
SMART capability 3
Error logging capability 1
Checksum 0xf1
Identification                     Status Current Worst         Raw data
  1 Raw read error rate            0xb         53    53          5133961
  2 Throughput performance         0x5        109   109              829
  3 Spin up time                   0x7        104   104      55878353565
  4 Start/Stop count               0x12       100   100               29
  5 Reallocated sector count       0x33         1     1                8
  7 Seek error rate                0xb        100   100                0
  8 Seek time performance          0x5        136   136               31
  9 Power on hours count           0x12        98    98            20134
 10 Spin retry count               0x13       100   100                0
 12 Device power cycle count       0x32       100   100               29
192 Power off retract count        0x32       100   100               65
193 Load cycle count               0x12       100   100               65
194 Temperature                    0x2        183   183  30/ 22/ 38 (degrees C cur/min/max)
196 Reallocation event count       0x32       100   100                8
197 Current pending sector count   0x22         1     1             1891
198 Scan uncorrected sector count  0x8          1     1            56254
199 Ultra DMA CRC error count      0xa        200   253                0

This drive might as well have been run over by a Mac truck. 56,254 scanned uncorrected sectors? Eject... immediately.

If you're a savvy storage admin, your keen mind is probly telling you to go and review Google's FAST paper: Failure Trends in a Large Disk Drive Population. This paper used Google's massive deployment to examine correlations between disk failures and, in particular, SMART data that might have predicted the failure. Its important to note that Google, wisely, considers a "failure" as any event in which an admin swaps the drive (errors, dead, whatever).

If you haven't read the paper, do it... now. But here is a couple of choice quotes relating to SMART:

  • Scan Errors: "After the first scan error, drives are 39 times more likely to fail within 60 days"
  • Reallocation Counts: "After the first reallocation, drives are over 14 times more likely to fail within 60 days"
  • Offline Reallocations: "After the first offline reallocation, drives have over 21 times higher changes of failure within 60 days"
  • Probational Counts: "after the first event, drives are 16 times more likely to fail within 60 days"
  • Conclusions: "Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severly limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever."

While many people have read this paper and simply walked away saying "Yup, SMART is useless, yet again" I want to disagree. When you combine Google's research with your SA instinct, we arrive at a good balance. To put it another way, I don't think you should poll SMART every 5 minutes and swap a drive because you get a non-zero value, but when you feel like there is something wrong with disks in your system and just don't have proof, SMART is the answer.

Next Article at SearchDataCenter.com

Friday, November 28th, 2008

My second article for SearchDataCenter.com is up here: Examining MySQL in real time using DTrace.

The interesting back story is that this article was requested after I spoke on DTrace at the MySQL Users Conference, but I’ve stalled on it… to the point that 3 weeks later I felt embarrassed to even submit it. As much as I love writing, its very different and difficult when its expected. My favorite mantra “You get what you pay for” works both ways. :)

So to all you authors out there that actually can manage to pop out columns and papers on fixed time lines, my hat is off to you. I’m hoping to submit a couple more articles to SearchDataCenter before the end of the year… hoping to pay off Christmas early. :)

Giving Thanks

Thursday, November 27th, 2008

Happy Thanksgiving one and all. Already the year is coming to a close… and what a year. Many unfortunate things have happened this year, if you’ve checked up on your 401K or IRA’s your feeling it. Uncertainty still hangs in the balance and many of us are fearful of the future, and thus all the more important that we stop, reflect on what we’ve accomplished, what we have, and most importantly who we have around us.

For myself, I’ have a lot to be thankful for this year. Its been a tough one, I’ve slipped badly on a variety of projects and goals and have been massively bummed out about it. And yet, I’ve learned a lot. I’ve grown in both experience and wisdom I think. I had the opportunity to start writing a column this year. I got to meet a lot of great people this year. Gave the first keynote of the first OpenSolaris Storage Summit, which was awesome. Got to be on stage at the Fishworks launch, and even better got to participate in the Fishworks development over the last couple years.

Most importantly, I got to continue working at Joyent around a lot of really good and smart people. Tamarah and I have grown together by yet another year, conceive Conrad, our third child (due to GA in Feb 09), and both Nova and Glenn have grown significantly… I am an immensely proud husband and father. We were blessed with a house, that we can afford. And I still get to do what I love, work from home on Solaris.

To all my readers and especially those of you who forgive my poor spelling and grammar, I thank you for your support.

Help with SAPro: Request for Participants

Wednesday, November 26th, 2008

No, the SA Pro podcast isn’t dead… but it needs help. I’ve been surprised how many great interviewees don’t want to do a podcast interview. Between work and shy guests its stalled.

I want to do 2 separate podcasts next week. I’d like 3-4 SA’s per round-table. Here are the topics:

  1. Time-Management and Project Management for SysAdmin’s on the move. Managing a lot of spinning plates is tough, especially when you have to manage multiple multi-week projects with minimal staff, assistance, etc. Adapting Software Development practices is a mixed bag of pain.
  2. Tough Economic Times. The economy is in the crapper, checking your 401K can cause health complications, and what is next is a big question mark. Got tips to share, ideas of how our industry will be impacted?

If you’re interested in participating in either of these please email me (benr – cuddletech.com). We will probly record one on Monday at 6PM Pacific and the other on Wed at 6PM Pacific. Participation is via Skype con-call. I can dial out to non-skype users, so don’t let that keep you from participating.

Remember the qualifications for participation are that you must have 10 years of industry experience, agree with the System Administrators’ Code of Ethics, and a sense of humor is considered a bonus.

Help a brother out… should be interesting discussions.

ZYNK: The Zuper Zimple ZFS Sync (Replication) Tool

Monday, November 17th, 2008

I’ve been working on building better and better ZFS replication tools for use at Joyent, but it often gets complex and frustrating because, although replication in ZFS is very simplistic, managing all the snapshots, retentions, and mountains of error checking and handling, on top of reporting and stats collection, is a nightmare.

So, just to relax I wrote a fun simple replication tool I call “Zynk”. It’s pathetically simple (read: elegant) and fun. As the comment says, if something breaks, its a pita to clean up, but otherwise should work well when set in motion. The intention is to run from cron every 30-600 seconds or so, but be aware that you should do the first run manually, because thats gonna take some time… the incrementals afterwards should be able to run in less than whatever frequency via cron you set.

#!/bin/bash
## ZYNK: The Zuper Zimple ZFS Sync (Replication) Tool
## Form: zynk local/dataset root@remote.host destination/dataset

# Please note: The reason this is so simple is because there is no error checking, reporting, or cleanup.
#               In the event that something goes wonkey, you'll manually need to fix the snapshots and
#               modify or remote the /var/run/zynk datafile which contains the most recent snapshot name.
# Furthermore, this absolutely relies on the GNU version of 'date' in order to get epoch time
# Before using, make sure you've distributed your SSH key to the remote host and can ssh without password.

if [ ! $3 ]
then
        echo "Usage: zynk local/dataset root@remote.host destination/dataset"
        echo "WARNING: The destination is the full path for the remote dataset, not the prefix dataset stub."
        exit
fi

DATE=`date +%s`
if [ $DATE == "%s" ]
then
        echo "Must use GNU Date, please install and modify script."
        exit
fi

if [ -e /var/run/zynk ]
then
        # Datafile is found, creating incr.
        echo "Incremental started at `date`"
        zfs snapshot ${1}@${DATE}
        zfs send  -i  ${1}@`cat /var/run/zynk` ${1}@${DATE} | ssh ${2} zfs recv -F ${3}
        zfs destroy ${1}@`cat /var/run/zynk`
        ssh ${2} zfs destroy ${3}@`cat /var/run/zynk`
        echo ${DATE} > /var/run/zynk
        echo "Incremental complete at `date`"
else
        # Datafile not found, creating full.
        echo "Full started at `date`"
        zfs snapshot ${1}@${DATE}
        zfs send     ${1}@${DATE} | ssh ${2} zfs recv ${3}
        echo ${DATE} > /var/run/zynk
        echo "Full completed at `date`"
fi

Here it is in action:

root@quadra ~$ rm /var/run/zynk
root@quadra ~$ ./zynk data/home/tamr root@localhost backup/zynk/tamr
Full started at Mon Nov 17 13:44:28 PST 2008
Full completed at Mon Nov 17 13:44:28 PST 2008

root@quadra ~$ ./zynk data/home/tamr root@localhost backup/zynk/tamr
Incremental started at Mon Nov 17 13:44:58 PST 2008
Incremental complete at Mon Nov 17 13:44:58 PST 2008

root@quadra ~$ ./zynk data/home/tamr root@localhost backup/zynk/tamr
Incremental started at Mon Nov 17 13:45:01 PST 2008
Incremental complete at Mon Nov 17 13:45:02 PST 2008

root@quadra ~$ ./zynk data/home/tamr root@localhost backup/zynk/tamr
Incremental started at Mon Nov 17 13:45:19 PST 2008
Incremental complete at Mon Nov 17 13:45:20 PST 2008

root@quadra ~$ zfs list -r data/home/tamr backup/zynk/tamr
NAME                          USED  AVAIL  REFER  MOUNTPOINT
backup/zynk/tamr             2.45M   296G  2.45M  /backup/zynk/tamr
backup/zynk/tamr@1226958319      0      -  2.45M  -
data/home/tamr               2.47M   196M  2.45M  /data/home/tamr
data/home/tamr@1226958319        0      -  2.45M  -

Whats important to note is that it only maintains a single snapshot on either source or destination, so you don’t consume a bunch of additional space or have to worry about screwing up quotas.

This isn’t intended so much as a “real” tool, but something you can play around with and hopefully excite the mind about some new fun applications. Add error checking, add retention, add reporting, re-implement in a new language. Have fun…. drink Zima. :)

For a thorough discussion of ZFS Replication, see my post from a couple weeks ago: Understanding ZFS: Replication, Archive and Backup.

Sun’s Tough Choices: Rich Green Books, 6,000 to Loose Job

Friday, November 14th, 2008

Today we got a press release no one wanted to see: Sun Microsystems Aligns Business with Global Economic Climate and Amplifies Growth Opportunities Across Open Source Platforms… how much more rosey you can make it sound, I don’t know. As a result, the stock traded higher today, by 1% at close, but the stock took a beating in after-hours trading and struggled to stay above $4 in early trading.

So lets break it down, for better or worse…

Many of us have known that Sun is still too large, in terms of employee count. Even with repetitive RIF’s (reduction in force), the company keeps making aquisitions and the cuts, as deep and painful as they are, haven’t been deep enough. What a horrible thing to say…. but there it is.

Rich Green is gone. Whether he resigned for personal reasons or not, I don’t know and don’t care. There are a great many of us glad to see him gone, but as for particulars, he hasn’t really been an outward facing fellow, so most news about him comes by way of rumor which I don’t like to subscribe to.

WIthin Systems Software there are now 3 groups:

  1. Application Platform Software, run by EVP Anil Gadre, formerly Chief Marketing Officer. So these guys handle everything on top of the systems software, including the whole Java stack, databases, software integration, and even Sun Ed.
  2. Systems Platforms, run by the EVP John Fowler, undoubtedly the most public facing of all Sun’s executives, will handle all the software that makes the hardware go, including Solaris, Virtualization including xVM and Virtualbox, management software, etc.
  3. Cloud Computing & Developer Platforms, run by SVP Dave Douglas, will handle NetBeans, OpenOffice, and Network.com, trying to put Sun in a position to leverage the Cloud and build new avenues of business for the company.

So we see a stack here. I’m curious as to how much of systems, from a hardware production perspective, will stay with John Fowler and who will be stepping in.

As usual, two big questions are floating in the air: a) when will Sun be acquired, and b) when will Jonathan step down. As to the first, I don’t think it will happen. The company is a tough one to deal with, and would invariably involve a lot of slicing and dicing. In this tough economic climate I don’t think anyone has the time or money to take on the problems of Sun. As to the latter, I like Jonathan and wish him only the greatest success. I do question many decisions he’s made, the MySQL acquisition first and foremost, and I’m not happy with how distracted Sun is by itself. Systems, Systems, Systems… make great systems, sell great systems, provide software that makes them better… systems systems systems! The bottom line is, the Sun Board of Directors will make that call, and clearly they haven’t felt it was the right decision.

I want to highlight that point. There is a lot of people after Jonathan’s head, but the blame for any missteps is on the Board. They are the final authority and they are signing off on it.

The future is unclear, but its time to streamline. I recommend reading my (Joyent) CEO David Young’s perspective: In the Business Cycle of Create, Improve, Destroy: What Sun Needs to Do Now, forward looking given that it was written on Nov 6th.

Fishworks Launches in Las Vegas

Thursday, November 13th, 2008

I was honored to be part of the historic FISHworks, now combined with the “Amber Road” hardware to become officially known as Sun Storage 7000 Unified Storage Systems. At the last minute on Friday, Sun PR apparently needed an extra customer walk-on and my friend Jason William’s from Digitar put in a good word for me… Monday morning I was in Vagas.

If you didn’t see the launch, have a look here. My walk on is 6 minutes into Chapter 4.

Having been running FISHworks for almost 2 years, I’m absolutely ecstatic about this release. Please download the Sun Unified Storage Simulator, which is FISHworks in a can… a VMware image for PC and Mac (apparently Mr. Marsland didn’t convince them to use VirtualBox).

I’m going to write about FW a lot more in the next couple days, but I want to make it very clear that while Analytics is sexxy and gets a lot of attention, this is a world class storage system that will rival your existing storage solutions both in terms of performance and ease-of-use, not to mention functionality. Watch the launch videos for an over-view and then start playing with the Sim.

Sun Storage Web Event Launch Nov 10th

Friday, November 7th, 2008

I always hate it when events sneak up on me, especially interesting storage events… so mark you calender for a live webcast and chat on Monday, November 10 at 3:30 p.m. PT. You’ll recall that Jonathan ominously mentioned in his blog recently: “You’ll be hearing more about Open Storage at a launch event we’re holding on November 10th.” So if you read that but weren’t sure what it was, this is the event. I’m sure it’ll be plastered all over Sun.com on Monday, just make sure you put aside time to watch the proceedings.

Additionally, Sun FORUM 2.0 “Sun’s premier annual customer storage event” (that I’d never heard of prior to this year, great PR!) is happening in the Customer Briefing Center of the Menlo Park campus next week. Use the link above to register for the event (free) if your in the Bay Area.

Sun.com Gets a Major Facelift

Friday, November 7th, 2008

If you don’t make it a habit to hit the Sun.com front page frequently, have a look. I’ve notice some incremental change over the last couple days and tonight its gotten a major facelift. It now pushes offers in your face more prominently (downloads and try-n-buys), integrates a lot more video (most is rehashed stuff that you may or may not have seen), and most interesting breaks down business sectors.

One interesting thing is banners like Transform Your Email with Microsoft Exchange Server 2007. This suggests a major shift in marketing focus; rather than push potential Java Enterprise solution stacks, just giving in to the fact that the business world is standardized on Exchange and sell sell sell.

I’m not suggesting that Sun hasn’t appealed to the Microsoft base in the past, we’ve gradually see Microsoft logo’s slip onto more and more presentation slide decks, but this feels like Sun’s finally put its arms around the beast in light of these tough economic times.

Frankly, I’d go even further to suggest that this is just the first step toward Sun really shifting into rescue mode. While I currently have no visibility into whats happening in the halls of Sun, the external vibe I’m getting is that there is extreme pressure on the executives to turn this thing around… or else. That might be a “duh” statement, but most of us will agree that Sun’s been in a tough position for a while now… something has changed drastically, something very recent, and I’m not sure its the economic downturn/crash, so much as its putting pressure on an already disastrous situation.

I could speculate further, but for now lets just watch and wait.

Understanding ZFS: Replication, Archive and Backup

Thursday, November 6th, 2008

As with other features of ZFS, the traditionally complex is made simple and straight forward. This simplification can coax administrators into a false complacency.

In ZFS, backup, archive, migration… any activity that fundamentally involves the movement of data from one system to another, is a replication activity. I propose that the traditional idea of weekly backups is, in fact, just really slow crappy replication. An HA Cluster replicates every 5 seconds, but your website replicates once a week…. its really the same thing, just via different interval and possibly different tools. So understand that when I say “replication” I refer to all forms of data movement, both intra- and inter- system.

ZFS replication is preformed through the use of two simplistic subcommands: zfs send and zfs recv. These are commands that utilize STDIN and STDOUT…. and why? Pipes my friend, pipes. Rather than bake piles of functionality into these commands, Matt Ahrens and the ZFS team opted to instead make them very simple and utilize the traditional UNIX ideology of connecting things together for something even better.

Lets look at a simple intra-system example of replication. Lets say that I have a workstation with an couple internal disks, perhaps a RAIDZ, who knows, and I then attach a USB or Firewire external drive on which I create a pool called “backups”. Lets now migrate a simple dataset from my local “data” pool to my external drives “backups” pool:

root@quadra ~$ zfs list -r data
NAME                USED  AVAIL  REFER  MOUNTPOINT
data                222K   218M    19K  /data
data/home           114K   200M    24K  /data/home
data/home/benr       18K   200M    18K  /data/home/benr
data/home/conradr    18K   200M    18K  /data/home/conradr
data/home/glennr     18K   200M    18K  /data/home/glennr
data/home/novar      18K   200M    18K  /data/home/novar
data/home/tamr       18K   200M    18K  /data/home/tamr
root@quadra ~$ zfs list -r backups
NAME      USED  AVAIL  REFER  MOUNTPOINT
backups  67.5K   218M    18K  /backups

root@quadra ~$ zfs snapshot data/home/benr@001
root@quadra ~$ zfs send data/home/benr@001 | zfs recv -d backups

root@quadra ~$ zfs list -r backups
NAME                    USED  AVAIL  REFER  MOUNTPOINT
backups                 191K   218M    19K  /backups
backups/home            106K   218M    18K  /backups/home
backups/home/benr        88K   218M    88K  /backups/home/benr
backups/home/benr@001      0      -    88K  -

Lets step through this together.

Replication is always based on a static point in time, meaning a snapshot. We create a snapshot of the dataset(s) we want to replicate, in this case the snapshot “001″ of benr’s home directory. Using the zfs send command we can send that snapshot to STDOUT. Using a UNIX Pipe, that STDOUT gets sent to the STDIN of the zfs recv command, which has been told via the -d backups argument that I want to preserve the dataset name and heirarchy under the “backups” dataset. This could just as easily a “backups/data-pool” dataset under which things are created, like so:

root@quadra ~$ zfs destroy -r backups/home
root@quadra ~$ zfs create backups/data-pool
root@quadra ~$ zfs send data/home/benr@001 | zfs recv -d backups/data-pool
root@quadra ~$ zfs list -r backups
NAME                              USED  AVAIL  REFER  MOUNTPOINT
backups                           217K   218M    20K  /backups
backups/data-pool                 125K   218M    19K  /backups/data-pool
backups/data-pool/home            106K   218M    18K  /backups/data-pool/home
backups/data-pool/home/benr        88K   218M    88K  /backups/data-pool/home/benr
backups/data-pool/home/benr@001      0      -    88K  -

What about incremental? I mean, I’ll want to freshen the copy right? This is done by created another snapshot, and then telling zfs send to only actually send the difference between the two:

root@quadra ~$ cp -r /etc/security/* /data/home/benr
root@quadra ~$ zfs snapshot data/home/benr@002
root@quadra ~$ zfs list -r data/home/benr
NAME                 USED  AVAIL  REFER  MOUNTPOINT
data/home/benr       379K   199M   355K  /data/home/benr
data/home/benr@001    24K      -    88K  -
data/home/benr@002      0      -   355K  -

root@quadra ~$ zfs send -i data/home/benr@001 data/home/benr@002 | zfs recv -d backups/data-pool
root@quadra ~$ zfs list -r backups/data-pool
NAME                              USED  AVAIL  REFER  MOUNTPOINT
backups/data-pool                 417K   217M    19K  /backups/data-pool
backups/data-pool/home            398K   217M    19K  /backups/data-pool/home
backups/data-pool/home/benr       379K   217M   355K  /backups/data-pool/home/benr
backups/data-pool/home/benr@001    24K      -    88K  -
backups/data-pool/home/benr@002      0      -   355K  -

So here I used ZFS send/recv almost exactly as before, but this time I tell zfs send about another snapshot from which to create an incremental. Notice that the zfs recv command didn’t change at all.

But what if I want to send it to another system? Easy, pipe the data through ssh (or rsh, or whatever) like so:

root@quadra ~$ zfs send data/home/benr@002 | ssh root@thumper.cuddletech.com zfs recv -d backups/data-pool

So thats the basics… but what does this mean? Lets get creative!

Firstly, we can write a script that every 30 seconds creates a new snapshot, and then thanks to pre-shared SSH keys can use SSH like above to recv the data elsewhere. Add a little error checking and presto! A really nice, simplistic replication scheme. Even if you have a lot of data change, if you copy it every 30 seconds its unlikely to build up into huge chunks that will take very long to move. When it comes to data that changes frequently, the key is to move early and often!

Now, say we don’t need that, simple backups are fine. We can create a script that creates a new snapshot each day at midnight, named the day of the week. When Wed comes around the old “wed” snapshot is removed and a new one created, and then we way create a simple script that zfs send/recv’s the Friday snapshot every weekend. Simple to do, plus we have those daily snapshots to fall back on in a pinch, hopefully keeping us from going out to a remote copy.

So we’ve used pipes in a simple way, to securly transport our datastream from one system to another. Consider other unique possiblities, such as piping zfs send… into gzip before sending across the network!

Or…. say what you really want is a portable dump of your ZFS dataset(s). Remember that we’re outputting a datastream from zfs send… just re-direct STDOUT to a file!

root@quadra ~$ zfs send data/home/benr@002 > /tmp/home-benr.zdump
root@quadra ~$ ls -lh /tmp/home-benr.zdump
-rw-r--r-- 1 root root 421K Nov  6 15:14 /tmp/home-benr.zdump

Now lets test a restore from this “zdump”:

root@quadra ~$ zfs create backups/dump-restore
root@quadra ~$ cat /tmp/home-benr.zdump | zfs recv -d backups/dump-restore
root@quadra ~$ zfs list -r backups/dump-restore
NAME                                 USED  AVAIL  REFER  MOUNTPOINT
backups/dump-restore                 392K   217M    19K  /backups/dump-restore
backups/dump-restore/home            373K   217M    18K  /backups/dump-restore/home
backups/dump-restore/home/benr       355K   217M   355K  /backups/dump-restore/home/benr
backups/dump-restore/home/benr@002      0      -   355K  -

Works like a charm! Again, we can use pipes for fun here too. Lets say that we really want a dump that is encrypted and compressed!

root@quadra ~$ pktool genkey keystore=file outkey=zdump.key keytype=aes keylen=128
root@quadra ~$ zfs send data/home/benr@002 | gzip | encrypt -a aes -k zdump.key > /tmp/home_benr-AES256GZ.zdump

So we’ve output a datastream based on a snapshot (002), compressed it, encrypted it with 128bit AES and then dumped to file. We could just as easily dump it to a tape (/dev/rmt/0cbn or something) for archiving purposes.

Finally, what if we want to work on more than just a single snapshot. What if we want to send all the “home” datasets? For some time now (although just now arriving in Solaris 10) we’ve had recursive flags for both zfs snapshot and zfs send. Lets give it a try:

root@quadra ~$ zfs snapshot -r data/home@nov6
root@quadra ~$ zfs list -r data/home
NAME                     USED  AVAIL  REFER  MOUNTPOINT
data/home                755K   199M    24K  /data/home
data/home@nov6              0      -    24K  -
data/home/benr           379K   199M   355K  /data/home/benr
data/home/benr@001        24K      -    88K  -
data/home/benr@002          0      -   355K  -
data/home/benr@nov6         0      -   355K  -
data/home/conradr         88K   199M    88K  /data/home/conradr
data/home/conradr@nov6      0      -    88K  -
data/home/glennr          88K   199M    88K  /data/home/glennr
data/home/glennr@nov6       0      -    88K  -
data/home/novar           88K   199M    88K  /data/home/novar
data/home/novar@nov6        0      -    88K  -
data/home/tamr            88K   199M    88K  /data/home/tamr
data/home/tamr@nov6         0      -    88K  -

root@quadra ~$ zfs destroy -r backups/home
root@quadra ~$ zfs list -r backups
NAME      USED  AVAIL  REFER  MOUNTPOINT
backups    86K   218M    20K  /backups

root@quadra ~$ zfs send -R data/home@nov6 | zfs recv -d backups

root@quadra ~$ zfs list -r backups
NAME                        USED  AVAIL  REFER  MOUNTPOINT
backups                     902K   217M    18K  /backups
backups/home                755K   199M    24K  /backups/home
backups/home@nov6              0      -    24K  -
backups/home/benr           379K   199M   355K  /backups/home/benr
backups/home/benr@001        24K      -    88K  -
backups/home/benr@002          0      -   355K  -
backups/home/benr@nov6         0      -   355K  -
backups/home/conradr         88K   199M    88K  /backups/home/conradr
backups/home/conradr@nov6      0      -    88K  -
backups/home/glennr          88K   199M    88K  /backups/home/glennr
backups/home/glennr@nov6       0      -    88K  -
backups/home/novar           88K   199M    88K  /backups/home/novar
backups/home/novar@nov6        0      -    88K  -
backups/home/tamr            88K   199M    88K  /backups/home/tamr
backups/home/tamr@nov6         0      -    88K  -

Simple, just snapshot the parent dataset with the -r flag, then send the parent dataset snapshot with the -R flag. Otherwise, its all the same! And, of course, you can combine this with all our other pipe tricks just the same!

And so we see that using a single set of commands, we have simplistic and powerful replication, backup, and archive capabilities. A lot of power unleashed with just a little imagination; thats the ZFS way.