Understanding ZFS: Replication, Archive and Backup

06 Nov '08 - 21:12 by benr

As with other features of ZFS, the traditionally complex is made simple and straight forward. This simplification can coax administrators into a false complacency.

In ZFS, backup, archive, migration... any activity that fundamentally involves the movement of data from one system to another, is a replication activity. I propose that the traditional idea of weekly backups is, in fact, just really slow crappy replication. An HA Cluster replicates every 5 seconds, but your website replicates once a week.... its really the same thing, just via different interval and possibly different tools. So understand that when I say "replication" I refer to all forms of data movement, both intra- and inter- system.

ZFS replication is preformed through the use of two simplistic subcommands: zfs send and zfs recv. These are commands that utilize STDIN and STDOUT.... and why? Pipes my friend, pipes. Rather than bake piles of functionality into these commands, Matt Ahrens and the ZFS team opted to instead make them very simple and utilize the traditional UNIX ideology of connecting things together for something even better.

Lets look at a simple intra-system example of replication. Lets say that I have a workstation with an couple internal disks, perhaps a RAIDZ, who knows, and I then attach a USB or Firewire external drive on which I create a pool called "backups". Lets now migrate a simple dataset from my local "data" pool to my external drives "backups" pool:

root@quadra ~$ zfs list -r data
data                222K   218M    19K  /data
data/home           114K   200M    24K  /data/home
data/home/benr       18K   200M    18K  /data/home/benr
data/home/conradr    18K   200M    18K  /data/home/conradr
data/home/glennr     18K   200M    18K  /data/home/glennr
data/home/novar      18K   200M    18K  /data/home/novar
data/home/tamr       18K   200M    18K  /data/home/tamr
root@quadra ~$ zfs list -r backups
backups  67.5K   218M    18K  /backups

root@quadra ~$ zfs snapshot data/home/benr@001
root@quadra ~$ zfs send data/home/benr@001 | zfs recv -d backups

root@quadra ~$ zfs list -r backups
backups                 191K   218M    19K  /backups
backups/home            106K   218M    18K  /backups/home
backups/home/benr        88K   218M    88K  /backups/home/benr
backups/home/benr@001      0      -    88K  -

Lets step through this together.

Replication is always based on a static point in time, meaning a snapshot. We create a snapshot of the dataset(s) we want to replicate, in this case the snapshot "001" of benr's home directory. Using the zfs send command we can send that snapshot to STDOUT. Using a UNIX Pipe, that STDOUT gets sent to the STDIN of the zfs recv command, which has been told via the -d backups argument that I want to preserve the dataset name and heirarchy under the "backups" dataset. This could just as easily a "backups/data-pool" dataset under which things are created, like so:

root@quadra ~$ zfs destroy -r backups/home
root@quadra ~$ zfs create backups/data-pool
root@quadra ~$ zfs send data/home/benr@001 | zfs recv -d backups/data-pool
root@quadra ~$ zfs list -r backups
NAME                              USED  AVAIL  REFER  MOUNTPOINT
backups                           217K   218M    20K  /backups
backups/data-pool                 125K   218M    19K  /backups/data-pool
backups/data-pool/home            106K   218M    18K  /backups/data-pool/home
backups/data-pool/home/benr        88K   218M    88K  /backups/data-pool/home/benr
backups/data-pool/home/benr@001      0      -    88K  -

What about incremental? I mean, I'll want to freshen the copy right? This is done by created another snapshot, and then telling zfs send to only actually send the difference between the two:

root@quadra ~$ cp -r /etc/security/* /data/home/benr
root@quadra ~$ zfs snapshot data/home/benr@002
root@quadra ~$ zfs list -r data/home/benr
data/home/benr       379K   199M   355K  /data/home/benr
data/home/benr@001    24K      -    88K  -
data/home/benr@002      0      -   355K  -

root@quadra ~$ zfs send -i data/home/benr@001 data/home/benr@002 | zfs recv -d backups/data-pool
root@quadra ~$ zfs list -r backups/data-pool
NAME                              USED  AVAIL  REFER  MOUNTPOINT
backups/data-pool                 417K   217M    19K  /backups/data-pool
backups/data-pool/home            398K   217M    19K  /backups/data-pool/home
backups/data-pool/home/benr       379K   217M   355K  /backups/data-pool/home/benr
backups/data-pool/home/benr@001    24K      -    88K  -
backups/data-pool/home/benr@002      0      -   355K  -

So here I used ZFS send/recv almost exactly as before, but this time I tell zfs send about another snapshot from which to create an incremental. Notice that the zfs recv command didn't change at all.

But what if I want to send it to another system? Easy, pipe the data through ssh (or rsh, or whatever) like so:

root@quadra ~$ zfs send data/home/benr@002 | ssh root@thumper.cuddletech.com zfs recv -d backups/data-pool

So thats the basics... but what does this mean? Lets get creative!

Firstly, we can write a script that every 30 seconds creates a new snapshot, and then thanks to pre-shared SSH keys can use SSH like above to recv the data elsewhere. Add a little error checking and presto! A really nice, simplistic replication scheme. Even if you have a lot of data change, if you copy it every 30 seconds its unlikely to build up into huge chunks that will take very long to move. When it comes to data that changes frequently, the key is to move early and often!

Now, say we don't need that, simple backups are fine. We can create a script that creates a new snapshot each day at midnight, named the day of the week. When Wed comes around the old "wed" snapshot is removed and a new one created, and then we way create a simple script that zfs send/recv's the Friday snapshot every weekend. Simple to do, plus we have those daily snapshots to fall back on in a pinch, hopefully keeping us from going out to a remote copy.

So we've used pipes in a simple way, to securly transport our datastream from one system to another. Consider other unique possiblities, such as piping zfs send... into gzip before sending across the network!

Or.... say what you really want is a portable dump of your ZFS dataset(s). Remember that we're outputting a datastream from zfs send... just re-direct STDOUT to a file!

root@quadra ~$ zfs send data/home/benr@002 > /tmp/home-benr.zdump  
root@quadra ~$ ls -lh /tmp/home-benr.zdump
-rw-r--r-- 1 root root 421K Nov  6 15:14 /tmp/home-benr.zdump

Now lets test a restore from this "zdump":

root@quadra ~$ zfs create backups/dump-restore               
root@quadra ~$ cat /tmp/home-benr.zdump | zfs recv -d backups/dump-restore
root@quadra ~$ zfs list -r backups/dump-restore
NAME                                 USED  AVAIL  REFER  MOUNTPOINT
backups/dump-restore                 392K   217M    19K  /backups/dump-restore
backups/dump-restore/home            373K   217M    18K  /backups/dump-restore/home
backups/dump-restore/home/benr       355K   217M   355K  /backups/dump-restore/home/benr
backups/dump-restore/home/benr@002      0      -   355K  -

Works like a charm! Again, we can use pipes for fun here too. Lets say that we really want a dump that is encrypted and compressed!

root@quadra ~$ pktool genkey keystore=file outkey=zdump.key keytype=aes keylen=128
root@quadra ~$ zfs send data/home/benr@002 | gzip | encrypt -a aes -k zdump.key > /tmp/home_benr-AES256GZ.zdump

So we've output a datastream based on a snapshot (002), compressed it, encrypted it with 128bit AES and then dumped to file. We could just as easily dump it to a tape (/dev/rmt/0cbn or something) for archiving purposes.

Finally, what if we want to work on more than just a single snapshot. What if we want to send all the "home" datasets? For some time now (although just now arriving in Solaris 10) we've had recursive flags for both zfs snapshot and zfs send. Lets give it a try:

root@quadra ~$ zfs snapshot -r data/home@nov6
root@quadra ~$ zfs list -r data/home
data/home                755K   199M    24K  /data/home
data/home@nov6              0      -    24K  -
data/home/benr           379K   199M   355K  /data/home/benr
data/home/benr@001        24K      -    88K  -
data/home/benr@002          0      -   355K  -
data/home/benr@nov6         0      -   355K  -
data/home/conradr         88K   199M    88K  /data/home/conradr
data/home/conradr@nov6      0      -    88K  -
data/home/glennr          88K   199M    88K  /data/home/glennr
data/home/glennr@nov6       0      -    88K  -
data/home/novar           88K   199M    88K  /data/home/novar
data/home/novar@nov6        0      -    88K  -
data/home/tamr            88K   199M    88K  /data/home/tamr
data/home/tamr@nov6         0      -    88K  -

root@quadra ~$ zfs destroy -r backups/home
root@quadra ~$ zfs list -r backups
backups    86K   218M    20K  /backups

root@quadra ~$ zfs send -R data/home@nov6 | zfs recv -d backups

root@quadra ~$ zfs list -r backups
NAME                        USED  AVAIL  REFER  MOUNTPOINT
backups                     902K   217M    18K  /backups
backups/home                755K   199M    24K  /backups/home
backups/home@nov6              0      -    24K  -
backups/home/benr           379K   199M   355K  /backups/home/benr
backups/home/benr@001        24K      -    88K  -
backups/home/benr@002          0      -   355K  -
backups/home/benr@nov6         0      -   355K  -
backups/home/conradr         88K   199M    88K  /backups/home/conradr
backups/home/conradr@nov6      0      -    88K  -
backups/home/glennr          88K   199M    88K  /backups/home/glennr
backups/home/glennr@nov6       0      -    88K  -
backups/home/novar           88K   199M    88K  /backups/home/novar
backups/home/novar@nov6        0      -    88K  -
backups/home/tamr            88K   199M    88K  /backups/home/tamr
backups/home/tamr@nov6         0      -    88K  -

Simple, just snapshot the parent dataset with the -r flag, then send the parent dataset snapshot with the -R flag. Otherwise, its all the same! And, of course, you can combine this with all our other pipe tricks just the same!

And so we see that using a single set of commands, we have simplistic and powerful replication, backup, and archive capabilities. A lot of power unleashed with just a little imagination; thats the ZFS way.

- - C O M M E N T S - -

I’m not sure, but I think I want ZFS to have my baby

Matt Simmons (Email) (URL) - 07 November '08 - 00:36

These posts are just fantastic, definitely great for Linux people like me who are OpenSolaris-curious and have 2008.5 installed on a spare partition… keep them coming!

Pedro (Email) - 07 November '08 - 06:19

Some kind of synchronous data replication based on zfs would be nice. This is often a requirement at places where fault tolerance is more important than high availability (or performance).

Robert - 07 November '08 - 13:20

Robert: Look at AVS: [[http://www.opensolaris.org/os/project/..]]

benr - 07 November '08 - 17:21

ZFS may be the coolest thing
SINCE sliced bread

JV - 09 November '08 - 05:55

The method you describe would be good for replacing licensed backup software. However I was wondering if it was possible to do this from within the Fishworks GUI ? From what I can gather, Fishworks doesn’t allow for access to a Solaris shell that can operate on the underlying filesystems.

Jeroen (Email) (URL) - 17 November '08 - 00:05

Yeah that’s cool part of ZFS – but what’s about scalability? For ex. I have a thousand of children filesystems and want to sync them all to backup server with -R. Have you tried something like that? I’m still trying but… it works terrible for me. Any suggestions? Thanks.

Mike (Email) - 20 November '08 - 22:26

Question regarding ZFS backups. I have ZFS pool (2.3x compression), when i backup a 100MB file becomes 300MB file on tape. Is there a better way to backup this data. The whole ZFS filesystem is quite large about 30TB used, and i afraid if the backup goes this way it will end up being 80TB on tape.

Any ideas/suggestions are really appreciated.


Nav (Email) - 18 April '09 - 00:44

These posts are just fantastic, definitely great for Linux people like me who are OpenSolaris-curious and have 2008.5 installed on a spare partition.

Chris (Email) (URL) - 10 May '09 - 07:41


At risk of stating the obviously which you likely know, but for the benefit of the uninitiated, given zfs conducts it’s own compression, when you transport/pipe it outside of the zfs filesystem to a tape device, for example, you have two immediate options: 1) pipe the raw data to the tape’s device with compression togged on or 2) pipe it through a compression utility beforehand like gzip and then to the tape device (with compression off or course).

Great post Ben.

Daniel Gomez (Email) (URL) - 05 June '09 - 15:29

It was a very nice idea! Just wanna say thank you for the information you have shared. Just continue writing this kind of post. I will be your loyal reader. Thanks again.

Great post on using ZFS send/recv for replication of ZFS datasets. When I first started my quest for replication of ZFS, I ran into AVS, which is quite impressive, however, definitely not easy to configure. In addition, I think hardware similarity is required between the two units. I wanted to replicate ZFS datasets to another storage device that had differing hard drive sizes in the ZFS pool.

In your article you mentioned being creative and creating a script to enable frequent replications between remote systems. This was the idea I was thinking of when I found your blog post. The difficulty I am having is process/logic to use for replicating that often, say every 30 seconds. To send/recv incrementally, you have to specify two snapshots. From what I can tell, those two snapshots must be the currently replicated snapshot level on the target system and the most recent (or more recent) snapshot on the source system. Unless I am misunderstanding, the kicker is that the “currently replicated” snapshot must reside on both systems.

How do you manage data requirements for that many snapshots (2880/day at 30 seconds) on a device that has 2TB of always changing data, such as in my situation which is an iSCSI target for VMware hosts. And if you do decide to manage the snapshots, by removing them I guess, how would the script know which snapshot to use? If the dataset on the target contains different snapshots than the one of the source, you may have to delete the ZFS dataset and start over.

Just some thoughts—- I will probably have a ZFS replication script on my blog at some point in the near future, with, hopefully, these issues resolved.


Aaron Gilbert (URL) - 22 January '10 - 23:03

I’ve read that using “zfs send” to tape is quite dangerous: small corruption renders the file unreadable by zfs receive.( [[http://www.solarisinternals.com/wiki/i..]] )

To circumvent this, I’m creating a replication pool in a file vdev:
mkfile -n 500g /backups/filepool.vdev
zpool create filepool /backups/filepool.vdev
zfs set compression=gzip-9 filepool
zfs send data/myfs@today|zfs receive filepool/myfs
zpool export filepool

now the file /backups/replpool.vdev can be stored to tape as a monthly backup file.
This way, I hope that in case of (small) corruption ZFS will be able to keep parts with good checksums.
An added benefit is that the tape can be used at its full speed.
restore is very fast too.
ZFS gzip compression is a lot faster than the gzip command itself on T5220.
To spare tape space, I’m doing incremental zfs send for the dailies.

My only question is whether in case of emergency, I can start a production quality service from a file vdev and replicate/mirror it to the real storage afterwards.
I’ve read somewhere that you should’nt use file vdev for other purpose than testing ( although iscsi volumes are generaly also implemented as big files…).


Michel Jansens - 31 August '10 - 07:00

