First Look at ZFS Deduplication

10 Nov '09 - 09:57 by benr

ZFS Deduplication was recently putback (Sun terminology for "commit") to ON (Solaris's primary codebase). That means it should go out at snv_128 (Build 128) due later this week.

Unable to wait for the BFU archives I resorted to actually building the code myself to play; something I've not felt the burning need to do for at least 2 years (I'll blog about that shortly). Here's the initial review...

In typical fashion putting ZFS Dedup to work is a trivial task. Zpools are created in the normal way, the dedup feature is enabled on a per-dataset basis and therefore is a simple matter of turning it on:

root@quadra ~$ zpool create stick c4t0d0
root@quadra ~$ zpool get all stick
NAME   PROPERTY       VALUE       SOURCE
stick  size           3.75G       -
stick  capacity       0%          -
stick  altroot        -           default
stick  health         ONLINE      -
stick  guid           12142487970365036186  default
stick  version        21          default
stick  bootfs         -           default
stick  delegation     on          default
stick  autoreplace    off         default
stick  cachefile      -           default
stick  failmode       wait        default
stick  listsnapshots  off         default
stick  autoexpand     off         default
stick  dedupratio     1.00x       -
stick  free           3.75G       -
stick  allocated      76.5K       -

Notice that there is no option to enable dedup for the pool, however there is a read-only "dedupratio" key. Because ZFS properties are inherited by child datasets we'll enable dedup on the root dataset, in this case "stick":

root@quadra ~$ zfs set dedup=on stick

Done! That's it. Really, you're done! Stop reading this now. :)

... ok, maybe I'll go into it a bit more.

As with many ZFS Dataset Properties, there can be more than one setting. The default value of the "dedup" properties is "off". It can also be set to "on", "sha256", "verify", or "fletcher4,verify". "on" is simply a pseudonym for "sha256". "verify" is a pseudonym for "sha256,verify" and enables an ability to detect and correct hash collisions, however this is very system intensive and is not recommended for casual use, if you require absolute integrity at all costs, go for it, but test your workload first. Phrases like "hash collision" can cause a panic, but remember that the odds are astronomical. For details on this see Jeff Bonwick's post on ZFS Dedup.

So, now for some testing. I've created my "stick" pool on a new 4GB micro-USB stick and enabled dedup. Lets copy in a bunch of JPEG's to several directories and see what happens:

root@quadra ~$ zfs list stick
NAME    USED  AVAIL  REFER  MOUNTPOINT
stick  73.5K  3.69G    21K  /stick
root@quadra ~$ mkdir /stick/userA
root@quadra ~$ mkdir /stick/userB
root@quadra ~$ mkdir /stick/userC
root@quadra ~$ cd img
root@quadra img$ time cp * /stick/userA
real    0m15.395s
user    0m0.005s
sys     0m0.174s
root@quadra img$ time cp * /stick/userB
real    0m15.952s
user    0m0.004s
sys     0m0.112s
root@quadra img$ time cp * /stick/userC
real    0m2.347s
user    0m0.004s
sys     0m0.125s

root@quadra img$ zfs list stick
NAME    USED  AVAIL  REFER  MOUNTPOINT
stick   203M  3.62G   203M  /stick

root@quadra img$ cd /stick/userA/
root@quadra userA$ du -sh .
74M     .

OK, notice that I'm copying in 74MB of data, 3 times, each to a different directory. (Its slow because its a crappy USB stick.) If we run du it registers the proper size, if we look at zfs list it shows the full size of 203MB. In fact, if I look at the dataset properties I have no indication at all of its on-disk size:

root@quadra userA$ zfs get all stick
NAME   PROPERTY              VALUE                  SOURCE
stick  type                  filesystem             -
stick  creation              Tue Nov 10  0:07 2009  -
stick  used                  220M                   -
stick  available             3.62G                  -
stick  referenced            220M                   -
stick  compressratio         1.00x                  -
stick  mounted               yes                    -
stick  quota                 none                   default
stick  reservation           none                   default
...

So here's the magic... look at the pool size:

root@quadra ~$ zpool list stick
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
stick  3.75G  72.5M  3.68G     1%  3.06x  ONLINE  -

Beautiful. 72.5MB allocated and we correctly see the dedup ratio of 3 (less than the file sizes, leading me to believe there are some duplicate images, which I don't doubt).

Yet again, ZFS makes it "just work". And you don't need a big huge expensive peice of gear, I'm deduping on this:

Suck it Data Domain. :)

For the elite ZFS Internals hackers out there, you can get a closer look at dedup using zdb -S (thanks to Jeff Victor for the tip):

root@quadra ~$ zdb -S stick
Simulated DDT histogram:

bucket              allocated                       referenced          
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     2      623   69.9M   69.9M   69.9M    1.83K    210M    210M    210M
     4       14   1.63M   1.63M   1.63M       84    9.8M    9.8M    9.8M
 Total      637   71.5M   71.5M   71.5M    1.91K    219M    219M    219M

dedup = 3.07, compress = 1.00, copies = 1.00, dedup * compress / copies = 3.07

So now its time to really beat on this thing and see if and where it breaks. Dedup for the masses is coming in the mail!!!


- - C O M M E N T S - -

Nice!
Now, whet happens with zfs send | zfs receive? Does it send deduped blocks only or undeduped as usual? Would be very nice to know.

Thanks for a great posts,
Dmitry

Dmitry Sorokin (Email) - 10 November '09 - 14:27

Ben, nice post.

Dmitry, zfs send/receive dedup was recent integrated. See [[http://www.c0t0d0s0.org/archives/6096-..]]

Derek Morr (Email) (URL) - 10 November '09 - 14:40

Do you have the ZFS version # for this?

Alan Pae (Email) (URL) - 10 November '09 - 20:52

Thank you very much. I am wonderring if I can share your article in the bookmarks of society,Then more friends can talk about this problem.

true religion jeans (Email) (URL) - 11 November '09 - 09:10

Derek, thanks for the reference.
Interesting way of implementing ZFS dedupe in the send/receive streams. I’m not too sure how useful that is however, if I understand the concept correctly.
Let’s take Ben’s example:
1. copy set of files to ZFS data set with dedupe enabled, then take a snapshot A.
2. zfs send -D | zfs receive the set based on snapshot A
3. copy the same set of files to different subdirectory and take snapshot B
4. zfs send -I -D | zfs receive the A to B delta
5. copy the same set of files to yet another different subdirectory and take snapshot C
6. zfs send -I -D | zfs receive the B to C delta

If I understand the current implementation of ZFS dedupe for send/receive streams, every time the snapshot delta is sent, the full size will be sent, as oppose to really sending nothing, as the data was seeded originally in step 2. Doesn’t really help with replication over WAN. Why not keep metadata for every checksummed block and really send over only blocks that don’t exist at the remote end?
This approach is much better and scalable, as oppose to deduplicating only blocks within one stream.
Again, I might misinterpreted the info, but that’s the way I understood it.
Anyone can confirm what’s really happening?

Thanks,
Dmitry

Dmitry Sorokin (Email) - 11 November '09 - 18:09

Great post! Hope to be better. Better means more features.
good post,I think so!
Thanks for your information, i have read it, very good!
Bing is a really overlord!! support Bing~~
This is great news. Best of luck for the future and keep up the good work.

links of london (Email) (URL) - 17 November '09 - 03:18

I’m curious as to why the cp in step 2 took as long as the cp in step 1? The data was already there, so I would have expected step 2 and 3 to take a similar amount of time.

zfscurious - 14 December '09 - 23:34

Just want to tell you that your blog is like having a whole banana split for dessert – fantastical.

christian louboutin shoes (Email) (URL) - 18 December '09 - 07:47

So, nice your posting. It look’s so good in your posting. It is useful for all.

[[http://www.webroyalty.com]]

Nick Matyas (Email) (URL) - 03 January '10 - 12:31

top quality ecco shoes

ecco shoes (Email) (URL) - 17 January '10 - 10:19

[[http://www.buykamagra.com]] buy kamagra
[[http://www.viagracialis.com]] viagra cialis

M65 Jacket (Email) (URL) - 21 January '10 - 02:38

good post

abercrombie clothing (Email) (URL) - 22 January '10 - 08:33

The sale of fake Gucci shoes is now a common event. But, the online shopping platform adds a fresh layer of deception. Individuals buying the fashionable shoes do not understand that the products they are buying are fake.

gucci women shoes (Email) (URL) - 23 January '10 - 03:52

Your article is very useful!Thank you for sharing.Nice post.

Tiffany Accessories (Email) (URL) - 08 February '10 - 08:45

Personal information





Remember your information?
Comment

Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.


^M