First Look at ZFS Deduplication
10 Nov '09 - 09:57 by benrZFS Deduplication was recently putback (Sun terminology for "commit") to ON (Solaris's primary codebase). That means it should go out at snv_128 (Build 128) due later this week.
Unable to wait for the BFU archives I resorted to actually building the code myself to play; something I've not felt the burning need to do for at least 2 years (I'll blog about that shortly). Here's the initial review...
In typical fashion putting ZFS Dedup to work is a trivial task. Zpools are created in the normal way, the dedup feature is enabled on a per-dataset basis and therefore is a simple matter of turning it on:
root@quadra ~$ zpool create stick c4t0d0 root@quadra ~$ zpool get all stick NAME PROPERTY VALUE SOURCE stick size 3.75G - stick capacity 0% - stick altroot - default stick health ONLINE - stick guid 12142487970365036186 default stick version 21 default stick bootfs - default stick delegation on default stick autoreplace off default stick cachefile - default stick failmode wait default stick listsnapshots off default stick autoexpand off default stick dedupratio 1.00x - stick free 3.75G - stick allocated 76.5K -
Notice that there is no option to enable dedup for the pool, however there is a read-only "dedupratio" key. Because ZFS properties are inherited by child datasets we'll enable dedup on the root dataset, in this case "stick":
root@quadra ~$ zfs set dedup=on stick
Done! That's it. Really, you're done! Stop reading this now. :)
... ok, maybe I'll go into it a bit more.
As with many ZFS Dataset Properties, there can be more than one setting. The default value of the "dedup" properties is "off". It can also be set to "on", "sha256", "verify", or "fletcher4,verify". "on" is simply a pseudonym for "sha256". "verify" is a pseudonym for "sha256,verify" and enables an ability to detect and correct hash collisions, however this is very system intensive and is not recommended for casual use, if you require absolute integrity at all costs, go for it, but test your workload first. Phrases like "hash collision" can cause a panic, but remember that the odds are astronomical. For details on this see Jeff Bonwick's post on ZFS Dedup.
So, now for some testing. I've created my "stick" pool on a new 4GB micro-USB stick and enabled dedup. Lets copy in a bunch of JPEG's to several directories and see what happens:
root@quadra ~$ zfs list stick NAME USED AVAIL REFER MOUNTPOINT stick 73.5K 3.69G 21K /stick root@quadra ~$ mkdir /stick/userA root@quadra ~$ mkdir /stick/userB root@quadra ~$ mkdir /stick/userC root@quadra ~$ cd img root@quadra img$ time cp * /stick/userA real 0m15.395s user 0m0.005s sys 0m0.174s root@quadra img$ time cp * /stick/userB real 0m15.952s user 0m0.004s sys 0m0.112s root@quadra img$ time cp * /stick/userC real 0m2.347s user 0m0.004s sys 0m0.125s root@quadra img$ zfs list stick NAME USED AVAIL REFER MOUNTPOINT stick 203M 3.62G 203M /stick root@quadra img$ cd /stick/userA/ root@quadra userA$ du -sh . 74M .
OK, notice that I'm copying in 74MB of data, 3 times, each to a different directory. (Its slow because its a crappy USB stick.) If we run du it registers the proper size, if we look at zfs list it shows the full size of 203MB. In fact, if I look at the dataset properties I have no indication at all of its on-disk size:
root@quadra userA$ zfs get all stick NAME PROPERTY VALUE SOURCE stick type filesystem - stick creation Tue Nov 10 0:07 2009 - stick used 220M - stick available 3.62G - stick referenced 220M - stick compressratio 1.00x - stick mounted yes - stick quota none default stick reservation none default ...
So here's the magic... look at the pool size:
root@quadra ~$ zpool list stick NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT stick 3.75G 72.5M 3.68G 1% 3.06x ONLINE -
Beautiful. 72.5MB allocated and we correctly see the dedup ratio of 3 (less than the file sizes, leading me to believe there are some duplicate images, which I don't doubt).
Yet again, ZFS makes it "just work". And you don't need a big huge expensive peice of gear, I'm deduping on this:
Suck it Data Domain. :)
For the elite ZFS Internals hackers out there, you can get a closer look at dedup using zdb -S (thanks to Jeff Victor for the tip):
root@quadra ~$ zdb -S stick
Simulated DDT histogram:
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
2 623 69.9M 69.9M 69.9M 1.83K 210M 210M 210M
4 14 1.63M 1.63M 1.63M 84 9.8M 9.8M 9.8M
Total 637 71.5M 71.5M 71.5M 1.91K 219M 219M 219M
dedup = 3.07, compress = 1.00, copies = 1.00, dedup * compress / copies = 3.07
So now its time to really beat on this thing and see if and where it breaks. Dedup for the masses is coming in the mail!!!
Nice!
Now, whet happens with zfs send | zfs receive? Does it send deduped blocks only or undeduped as usual? Would be very nice to know.
Thanks for a great posts,
Dmitry
Dmitry Sorokin (Email) - 10 November '09 - 14:27
Ben, nice post.Dmitry, zfs send/receive dedup was recent integrated. See [[http://www.c0t0d0s0.org/archives/6096-..]]
Derek Morr (Email) (URL) - 10 November '09 - 14:40
Do you have the ZFS version # for this?Alan Pae (Email) (URL) - 10 November '09 - 20:52
Thank you very much. I am wonderring if I can share your article in the bookmarks of society,Then more friends can talk about this problem.true religion jeans (Email) (URL) - 11 November '09 - 09:10
Derek, thanks for the reference.Interesting way of implementing ZFS dedupe in the send/receive streams. I’m not too sure how useful that is however, if I understand the concept correctly.
Let’s take Ben’s example:
1. copy set of files to ZFS data set with dedupe enabled, then take a snapshot A.
2. zfs send -D | zfs receive the set based on snapshot A
3. copy the same set of files to different subdirectory and take snapshot B
4. zfs send -I -D | zfs receive the A to B delta
5. copy the same set of files to yet another different subdirectory and take snapshot C
6. zfs send -I -D | zfs receive the B to C delta
If I understand the current implementation of ZFS dedupe for send/receive streams, every time the snapshot delta is sent, the full size will be sent, as oppose to really sending nothing, as the data was seeded originally in step 2. Doesn’t really help with replication over WAN. Why not keep metadata for every checksummed block and really send over only blocks that don’t exist at the remote end?
This approach is much better and scalable, as oppose to deduplicating only blocks within one stream.
Again, I might misinterpreted the info, but that’s the way I understood it.
Anyone can confirm what’s really happening?
Thanks,
Dmitry
Dmitry Sorokin (Email) - 11 November '09 - 18:09
Great post! Hope to be better. Better means more features.good post,I think so!
Thanks for your information, i have read it, very good!
Bing is a really overlord!! support Bing~~
This is great news. Best of luck for the future and keep up the good work.
links of london (Email) (URL) - 17 November '09 - 03:18
I’m curious as to why the cp in step 2 took as long as the cp in step 1? The data was already there, so I would have expected step 2 and 3 to take a similar amount of time.zfscurious - 14 December '09 - 23:34
Just want to tell you that your blog is like having a whole banana split for dessert – fantastical.christian louboutin shoes (Email) (URL) - 18 December '09 - 07:47
So, nice your posting. It look’s so good in your posting. It is useful for all.[[http://www.webroyalty.com]]
Nick Matyas (Email) (URL) - 03 January '10 - 12:31
top quality ecco shoesecco shoes (Email) (URL) - 17 January '10 - 10:19
[[http://www.buykamagra.com]] buy kamagra[[http://www.viagracialis.com]] viagra cialis
M65 Jacket (Email) (URL) - 21 January '10 - 02:38
good postabercrombie clothing (Email) (URL) - 22 January '10 - 08:33
The sale of fake Gucci shoes is now a common event. But, the online shopping platform adds a fresh layer of deception. Individuals buying the fashionable shoes do not understand that the products they are buying are fake.gucci women shoes (Email) (URL) - 23 January '10 - 03:52
Your article is very useful!Thank you for sharing.Nice post.Tiffany Accessories (Email) (URL) - 08 February '10 - 08:45
competitive price ,high quality and beautiful designecco shoes