Cuddletech | ZFS Backup & Recovery using Hadoop HDFS

ZFS Backup & Recovery using Hadoop HDFS

Posted on April 18, 2011

Hadoop HDFS has essentially become the de facto standard in cluster file systems. I’m theory I’m a big fan of Lustre; I say “theory” because it never got ported to Solaris, despite the fact that Sun bought Lustre. But thats a different story. HDFS is extremely portable and well supported by a thriving community who are doing anything you can image with it.

Consider a large cluster of production nodes. They almost certainly have unused disk space, and its probly pretty fast disk. Wouldn’t it be nice if we could aggregate all that disk together for backups? With HDFS we can. Setting up HDFS is pretty well documented (I can do it for Solaris users if there is a demand, but its pretty clear), so see how easy it is to get ZFS backups in and out of HDFS.

Here is a DFS report from my HDFS test setup (single node):

root@newton hadoop$ hadoop dfsadmin -report
Safe mode is ON
Configured Capacity: 2942589468672 (2.68 TB)
Present Capacity: 1019145412608 (949.15 GB)
DFS Remaining: 1017809015808 (947.91 GB)
DFS Used: 1336396800 (1.24 GB)
DFS Used%: 0.13%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)

Name: 127.0.0.1:50010
Decommission Status : Normal
Configured Capacity: 2942589468672 (2.68 TB)
DFS Used: 1336396800 (1.24 GB)
Non DFS Used: 1923444056064 (1.75 TB)
DFS Remaining: 1017809015808(947.91 GB)
DFS Used%: 0.05%
DFS Remaining%: 34.59%
Last contact: Mon Apr 18 14:18:30 PDT 2011

So HDFS is ready to rock. Now lets create a ZFS dataset and populate it with some data:

root@newton ~$ zfs create -o mountpoint=/backup_test quadra/backup_test
root@newton ~$ cp *.pdf /backup_test/

Now, lets create a directory within HDFS to put our backups in:

root@newton ~$ hadoop fs -mkdir /zfs_backups
root@newton ~$ hadoop fs -ls /
Found 11 items
...
drwxrwxrwx   - root supergroup          0 2011-03-05 00:07 /hypertable
drwxr-xr-x   - root supergroup          0 2011-03-02 12:27 /system
drwxr-xr-x   - root supergroup          0 2011-04-18 14:22 /zfs_backups

Ready to rock. So lets actually do the backup. We’re going to create a snapshot and the zfs send it to the stdin of “hadoop fs -put”. Once we’ve done that, we’ll delete our origonal ZFS dataset:

root@newton ~$ zfs snapshot quadra/backup_test@051811
root@newton ~$ zfs send quadra/backup_test@051811 | hadoop \
> fs -put - /zfs_backups/backup_test.051811.zdump
root@newton ~$
root@newton ~$ zfs destroy -r quadra/backup_test
root@newton ~$ zfs list quadra/backup_test
cannot open 'quadra/backup_test': dataset does not exist

OK, so the ZFS snapshot has been stored as a file within HDFS and we destroyed our dataset. Now, lets recover it using the reverse proceedure:

root@newton ~$ hadoop fs -get /zfs_backups/backup_test.051811.zdump \
>  - | zfs recv -d quadra
root@newton ~$ zfs list -r quadra/backup_test
NAME                        USED  AVAIL  REFER  MOUNTPOINT
quadra/backup_test         32.3M   316G  32.3M  /quadra/backup_test
quadra/backup_test@051811      0      -  32.3M  -

Notice that we lost our properties during the receive, lets fix that and check that our files are back:

root@newton ~$ zfs set mountpoint=/backup_test quadra/backup_test
root@newton ~$ ls -l /backup_test/
total 33076
-rw-r--r-- 1 root root   330028 2011-04-18 14:20 Deployment_Guide_for_HP_ProLiant_Servers.pdf
-rw-r--r-- 1 root root    88378 2011-04-18 14:20 GeekBench-Receipt.pdf
-rw-r--r-- 1 root root   101243 2011-04-18 14:20 HP_ProLiant_Health_Monitor_User_Guide.pdf
-rw-r--r-- 1 root root    90844 2011-04-18 14:20 HP_ProLiant_Support_Pack_User_Guide_861.pdf
-rw-r--r-- 1 root root   123419 2011-04-18 14:20 inthebeginning.pdf
-rw-r--r-- 1 root root 21337122 2011-04-18 14:20 Jurans Quality Handbook.pdf
-rw-r--r-- 1 root root  1338119 2011-04-18 14:20 perc-technical-guidebook.pdf
-rw-r--r-- 1 root root  2401352 2011-04-18 14:20 PowerEdgeR510_Technical_Guidebook[1].pdf
-rw-r--r-- 1 root root  5923274 2011-04-18 14:20 R710-HOM.pdf
-rw-r--r-- 1 root root  1103929 2011-04-18 14:20 server-poweredge-r710-tech-guidebook.pdf
-rw-r--r-- 1 root root   504059 2011-04-18 14:20 TheQualityTrilogy.pdf

Please note that the reason the dataset properties were not retained was because I’m using an old ZPool (Version 18). If your running a newer pool version (check with “zpool get all pool_name”) the properties will go with the backup stream.

So there you go. Backup and recovery using ZFS send/recv to and from HDFS. Straightforward and easy to implement.