Cuddletech | zdb: Examining ZFS At Point-Blank Range

zdb: Examining ZFS At Point-Blank Range

Posted on November 1, 2008

ZFS is an amazing in its simplicity and beauty, however it is also deceivingly complex. The chance that you’ll ever be forced to peer behind the veil is unlikely outside of the storage enthusiast ranks, but as it proliferates more questions will come up regarding its internals. We have been given a tool to assist us investigate the inner workings, zdb, but it is, somewhat intentionally I think, undocumented. Only two others that I know have had the courage to talk about it publicly, Max Bruning who is perhaps the single most authoritative voice regarding ZFS outside of Sun, and Marcelo Leal.

In this post, we’ll look only at the basics of ZDB to establish a baseline for its use. Running “zdb -h” will produce a summary of its syntax.

In its most basic form, zdb poolname, several bits of information about our pool will be output, including:

Cached pool configuratino (-C)
Uberblock (-u)
Datasets (-d)
Report stats on zdb’s I/O (-s), this is similar to the first interval of zpool iostat

Thus, zdb testpool is the same as zdb -Cuds testpool. Lets look at the output. The pool we’ll be using is actually a 256MB pre-allocated file with a single dataset… as simple as it can come.

root@quadra /$ zdb testpool
    version=12
    name='testpool'
    state=0
    txg=182
    pool_guid=1019414024587234776
    hostid=446817667
    hostname='quadra'
    vdev_tree
        type='root'
        id=0
        guid=1019414024587234776
        children[0]
                type='file'
                id=0
                guid=6723707841658505514
                path='/zdev/disk002'
                metaslab_array=23
                metaslab_shift=21
                ashift=9
                asize=263716864
                is_log=0
Uberblock

        magic = 0000000000bab10c
        version = 12
        txg = 184
        guid_sum = 7743121866245740290
        timestamp = 1225486684 UTC = Fri Oct 31 13:58:04 2008

Dataset mos [META], ID 0, cr_txg 4, 87.0K, 49 objects
Dataset testpool/dataset01 [ZPL], ID 30, cr_txg 6, 19.5K, 5 objects
Dataset testpool [ZPL], ID 16, cr_txg 1, 19.0K, 5 objects
                      capacity   operations   bandwidth  ---- errors ----
description          used avail  read write  read write  read write cksum
testpool             139K  250M   638     0  736K     0     0     0     0
  /zdev/disk002      139K  250M   638     0  736K     0     0     0     0

And so we see a variety of useful information, including:

Zpool (On Disk Format) Version Number
State
Host ID & Hostname
GUID (This is that numberic value you use when zpool import doesn’t like the name)
Children VDEV’s that make up the pool
Uberblock magic number (read that hex value as “uba-bloc”, get it, 0bab10c, its funny!)
Timestamp
List of datasets
Summary of IO stats

So this information is interesting, but frankly not terribly useful if you already have the pool imported. This would likely be of more value if you couldn’t, or wouldn’t, import the pool, but those cases are rare and 99% of the time zpool import will tell you want you want to know even if you don’t actually import.

There are 3 arguments that are really the core ones of interest, but fefore we get to them, you absolutely must understand something unique about zdb. ZDB is like a magnifying glass, at default magnification you can see that its tissue, turn up the magnification and you see that it has veins, turn it up again and you see how intricate the system is, crank it up one more time and you can see blood cells themselves. With zdb, each time we repeat an argument we increase the verbosity and thus dig deeper. For instance, zdb -d will list the datasets of a pool, but zdb -dd will output the list of objects within the pool. Thus, when you really zoom in you’ll see commands that look really odd like zdb -ddddddddd. This takes a little practice to get the hang of, so please toy around on a small test pool to get the hang of it.

Now, here are summaries of the 3 primary arguments you’ll use and how things change as you crank up the verbosity:

- -bb: Outputs a breakdown of space (block) usage for various ZFS object types.
- -bbb: Same as above, but includes breakdown by DMU/SPA level (L0-L6).
- -bbbb: Same as above, but includes line line per object with details about it, including compression, checksum, DVA, object ID, etc.
- -bbbbb…: Same as above.
- -d: Output list of datasets, including ID, cr_txg, size, and number of objects.
- –dd: Output concise list of objects within the dataset, with object id, lsize, asize, type, etc.
- -ddd: Same as dd.
- -dddd: Outputs list of datasets and objects in detail, including objects path (filename), a/c/r/mtime, mode, etc.
- -ddddd: Same as previous, but includes indirect block addresses (DVAs) as well.
- -dddddd….: Same as above.
zdb -R pool:vdev_specifier:offset:size[:flags]: Given a DVA, outputs object contents in hex display format. If given the :r flag it will output in raw binary format. This can be used for manual recovery of files.

So lets play with the first form above, block traversal. This will sweep the blocks of your pool or dataset adding up what it finds and then producing a report of any leakage and how the space breakdown works. This is extremely useful information, but given that it traverses all blocks its going to take a long time depending on how much data you have. On a home box this might take minutes or a couple hours, on a large storage subsystem is could take hours or days. Lets look at both -b and -bb for my simple test pool:

root@quadra ~$ zdb -b testpool

Traversing all blocks to verify nothing leaked ...

        No leaks (block sum matches space maps exactly)

        bp count:              50
        bp logical:        464896        avg:   9297
        bp physical:        40960        avg:    819    compression:  11.35
        bp allocated:      102912        avg:   2058    compression:   4.52
        SPA allocated:     102912       used:  0.04%

root@quadra ~$ zdb -bb testpool

Traversing all blocks to verify nothing leaked ...

        No leaks (block sum matches space maps exactly)

        bp count:              50
        bp logical:        464896        avg:   9297
        bp physical:        40960        avg:    819    compression:  11.35
        bp allocated:      102912        avg:   2058    compression:   4.52
        SPA allocated:     102912       used:  0.04%

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
     3  12.0K   1.50K   4.50K   1.50K    8.00     4.48  deferred free
     1    512     512   1.50K   1.50K    1.00     1.49  object directory
     1    512     512   1.50K   1.50K    1.00     1.49  object array
     1    16K      1K   3.00K   3.00K   16.00     2.99  packed nvlist
     -      -       -       -       -       -        -  packed nvlist size
     1    16K      1K   3.00K   3.00K   16.00     2.99  bplist
     -      -       -       -       -       -        -  bplist header
     -      -       -       -       -       -        -  SPA space map header
     3  12.0K   1.50K   4.50K   1.50K    8.00     4.48  SPA space map
     -      -       -       -       -       -        -  ZIL intent log
    16   256K   18.0K   40.0K   2.50K   14.22    39.80  DMU dnode
     3  3.00K   1.50K   3.50K   1.17K    2.00     3.48  DMU objset
     -      -       -       -       -       -        -  DSL directory
     4     2K      2K   6.00K   1.50K    1.00     5.97  DSL directory child map
     3  1.50K   1.50K   4.50K   1.50K    1.00     4.48  DSL dataset snap map
     4     2K      2K   6.00K   1.50K    1.00     5.97  DSL props
     -      -       -       -       -       -        -  DSL dataset
     -      -       -       -       -       -        -  ZFS znode
     -      -       -       -       -       -        -  ZFS V0 ACL
     1    512     512     512     512    1.00     0.50  ZFS plain file
     3  1.50K   1.50K   3.00K      1K    1.00     2.99  ZFS directory
     2     1K      1K      2K      1K    1.00     1.99  ZFS master node
     2     1K      1K      2K      1K    1.00     1.99  ZFS delete queue
     -      -       -       -       -       -        -  zvol object
     -      -       -       -       -       -        -  zvol prop
     -      -       -       -       -       -        -  other uint8[]
     -      -       -       -       -       -        -  other uint64[]
     -      -       -       -       -       -        -  other ZAP
     -      -       -       -       -       -        -  persistent error log
     1   128K   4.50K   13.5K   13.5K   28.44    13.43  SPA history
     -      -       -       -       -       -        -  SPA history offsets
     -      -       -       -       -       -        -  Pool properties
     -      -       -       -       -       -        -  DSL permissions
     -      -       -       -       -       -        -  ZFS ACL
     -      -       -       -       -       -        -  ZFS SYSACL
     -      -       -       -       -       -        -  FUID table
     -      -       -       -       -       -        -  FUID table size
     1    512     512   1.50K   1.50K    1.00     1.49  DSL dataset next clones
     -      -       -       -       -       -        -  scrub work queue
    50   454K   40.0K    101K   2.01K   11.35   100.00  Total

Here we can see the “zooming in” effect I described earlier. Here “BP” stands for “Block Pointer”. The most common “Type” you’ll see is “ZFS plain file”, that is, a normal data file like an image or textfile or something… the data you care about.

Moving on to the second form, -d to output datasets and their objects. This is where introspection really occurs. With a simple -d we can see a recursive list of datasets, but as we turn up the verbosity (-dd) we zoom into the objects within the dataset, and then just get more and more detail about those objects.

root@quadra ~$ zdb -d testpool/dataset01
Dataset testpool/dataset01 [ZPL], ID 30, cr_txg 6, 18.5K, 5 objects

root@quadra ~$ zdb -dd testpool/dataset01
Dataset testpool/dataset01 [ZPL], ID 30, cr_txg 6, 18.5K, 5 objects

    Object  lvl   iblk   dblk  lsize  asize  type
         0    7    16K    16K    16K  14.0K  DMU dnode
         1    1    16K    512    512     1K  ZFS master node
         2    1    16K    512    512     1K  ZFS delete queue
         3    1    16K    512    512     1K  ZFS directory
         4    1    16K    512    512    512  ZFS plain file

So lets pause here. We can see the list of objects in my testpool/dataset01 by object id. This is important because we can use those id’s to dig deeper on an individual object later. But for now, lets zoom in a little bit more (-dddd) on this dataset.

root@quadra ~$ zdb -dddd testpool/dataset01
Dataset testpool/dataset01 [ZPL], ID 30, cr_txg 6, 18.5K, 5 objects, rootbp [L0 DMU objset] 400L/200P DVA[0]= DVA[1]= fletcher4 lzjb LE contiguous birth=8 fill=5 cksum=a525c6edf:45d1513a8c8:ef844ac0e80e:22b9de6164dd69

    Object  lvl   iblk   dblk  lsize  asize  type
         0    7    16K    16K    16K  14.0K  DMU dnode

    Object  lvl   iblk   dblk  lsize  asize  type
         1    1    16K    512    512     1K  ZFS master node
        microzap: 512 bytes, 6 entries

                casesensitivity = 0 
                normalization = 0 
                DELETE_QUEUE = 2 
                ROOT = 3 
                VERSION = 3 
                utf8only = 0 

    Object  lvl   iblk   dblk  lsize  asize  type
         2    1    16K    512    512     1K  ZFS delete queue
        microzap: 512 bytes, 0 entries


    Object  lvl   iblk   dblk  lsize  asize  type
         3    1    16K    512    512     1K  ZFS directory
                                 264  bonus  ZFS znode
        path    /
        uid     0
        gid     0
        atime   Fri Oct 31 12:35:30 2008
        mtime   Fri Oct 31 12:35:51 2008
        ctime   Fri Oct 31 12:35:51 2008
        crtime  Fri Oct 31 12:35:30 2008
        gen     6
        mode    40755
        size    3
        parent  3
        links   2
        xattr   0
        rdev    0x0000000000000000
        microzap: 512 bytes, 1 entries

                testfile01 = 4 (type: Regular File)

    Object  lvl   iblk   dblk  lsize  asize  type
         4    1    16K    512    512    512  ZFS plain file
                                 264  bonus  ZFS znode
        path    /testfile01
        uid     0
        gid     0
        atime   Fri Oct 31 12:35:51 2008
        mtime   Fri Oct 31 12:35:51 2008
        ctime   Fri Oct 31 12:35:51 2008
        crtime  Fri Oct 31 12:35:51 2008
        gen     8
        mode    100644
        size    21
        parent  3
        links   1
        xattr   0
        rdev    0x0000000000000000

Now, this output is short because the dataset include only a single file. In the real world this output will be gigantic and should be redirected to a file. When I did this on the dataset containing my home directory the output file was 750MB… its a lot of data.

Look specifically at Object 4, a “ZFS plain file”. Notice that I can see that files pathname, uid, gid, a/m/c/crtime, mode, size, etc. This is where things can get really interesting!

In zdb’s 3rd form above (-R) we can actually display the contents of a file, however we need its Device Virtual Address (DVA) and size to do so. In order to get that information, we can zoom in using -d little further, but this time just on Object 4:

root@quadra /$ zdb -ddddd testpool/dataset01 4
Dataset testpool/dataset01 [ZPL], ID 30, cr_txg 6, 19.5K, 5 objects, rootbp [L0 DMU objset] 400L/200P DVA[0]= DVA[1]= fletcher4 lzjb LE contiguous birth=168 fill=5 cksum=a280728d9:448b88156d8:eaa0ad340c25:21f1a0a7d45740

    Object  lvl   iblk   dblk  lsize  asize  type
         4    1    16K    512    512    512  ZFS plain file
                                 264  bonus  ZFS znode
        path    /testfile01
        uid     0
        gid     0
        atime   Fri Oct 31 12:35:51 2008
        mtime   Fri Oct 31 12:35:51 2008
        ctime   Fri Oct 31 12:35:51 2008
        crtime  Fri Oct 31 12:35:51 2008
        gen     8
        mode    100644
        size    21
        parent  3
        links   1
        xattr   0
        rdev    0x0000000000000000
Indirect blocks:
               0 L0 0:11600:200 200L/200P F=1 B=8

                segment [0000000000000000, 0000000000000200) size   512

Now, see that “Indirect block” 0? Following L0 (Level 0) is a tuple: “0:11600:200”. This is the DVA and Size, or more specifically it is the triple: vdev:offset:size. We can use this information to request its contents directly.

And so, the -R form can display and individual blocks from a device. To do so, we need to know the pool name, vdev/offset (DVA) and its size. Given what we did above, we now know that, so lets try it:

root@quadra /$ zdb -R testpool:0:11600:200
Found vdev: /zdev/disk002

testpool:0:11600:200
          0 1 2 3 4 5 6 7   8 9 a b c d e f  0123456789abcdef
000000:  2073692073696854  6620747365742061  This is a test f
000010:  0000000a2e656c69  0000000000000000  ile.............
000020:  0000000000000000  0000000000000000  ................
000030:  0000000000000000  0000000000000000  ................
000040:  0000000000000000  0000000000000000  ................
000050:  0000000000000000  0000000000000000  ................
...

w00t! We can read the file contents!

You’ll notice in the zdb syntax (“zdb -h”) that this syntax above accepts flags as well. We can find these in the ZDB source. The most interesting is the “r” flag which rather than display the data as above, actually dumps the data in raw form to STDERR.

So why is this useful? Try this on for size:

root@quadra /$ rm /testpool/dataset01/testfile01
root@quadra /$ sync;sync
root@quadra /$ zdb -dd testpool/dataset01
Dataset testpool/dataset01 [ZPL], ID 30, cr_txg 6, 18.0K, 4 objects

    Object  lvl   iblk   dblk  lsize  asize  type
         0    7    16K    16K    16K  14.0K  DMU dnode
         1    1    16K    512    512     1K  ZFS master node
         2    1    16K    512    512     1K  ZFS delete queue
         3    1    16K    512    512     1K  ZFS directory

 ....... THE FILE IS REALLY GONE! ..........

root@quadra /$ zdb -R testpool:0:11600:200:r  2> /tmp/output
Found vdev: /zdev/disk002
root@quadra /$ ls -lh /tmp/output 
-rw-r--r-- 1 root root 512 Nov  1 01:54 /tmp/output
root@quadra /$ cat /tmp/output 
This is a test file.

How sweet is that! We delete a file, verify with zdb -dd that it really and truely is gone, and then bring it back out based on its DVA. Super sweet!

Now, before you get overly excited, some things to note… firstly, if you delete a file in the real world you probly don’t have its DVA and size already recorded, so your screwed. Also, notice that the origonal file was 21 bytes, but the “recovered” file is 512… its been padded, so if you recovered a file and tried using an MD5 hash or something to verify the content it wouldn’t match, even though the data was valid. In other words, the best “undelete” option is snapshots.. they are quick, easy, use them. Using zdb for file recovery isn’t practical.

I recently discovered and used this method to deal with a server that suffered extensive corruption as a result of a shitty (Sun Adaptec rebranded STK) RAID controller gone berzerk following a routine disk replacement. I had several “corrupt” files that I could not read or reach, if I tried to do so I’d get a long pause, lots of errors to syslog, and then a “I/O Error” return. Hopeless, this is a “restore from backups” situation. Regardless, I wanted to learn from the experience. Here is an example of the result:

[root@server ~]$ ls -l /xxxxxxxxxxxxxx/images/logo.gif
/xxxxxxxxxxxxxx/images/logo.gif: I/O error

[root@server ~]$  zdb -ddddd pool/xxxxx 181359
Dataset pool/xxx [ZPL], ID 221, cr_txg 1281077, 3.76G, 187142 objects, rootbp [L0 DMU objset] 400L/200P DVA[0]= DVA[1]= fletcher4 
lzjb LE contiguous birth=4543000 fill=1
87142 cksum=8cc6b0fec:3a1b508e8c0:c36726aec831:1be1f0eee0e22c

    Object  lvl   iblk   dblk  lsize  asize  type
    181359    1    16K     1K     1K     1K  ZFS plain file
                                 264  bonus  ZFS znode
        path    /xxxxxxxxxxxxxx/images/logo.gif
        atime   Wed Aug 27 07:42:17 2008
        mtime   Wed Apr 16 01:19:06 2008
        ctime   Thu May  1 00:18:34 2008
        crtime  Thu May  1 00:18:34 2008
        gen     1461218       
        mode    100644
        size    691  
        parent  181080
        links   1
        xattr   0
        rdev    0x0000000000000000
Indirect blocks:
               0 L0 0:b043f0c00:400 400L/400P F=1 B=1461218

                segment [0000000000000000, 0000000000000400) size    1K

[root@server ~]$ zdb -R pool:0:b043f0c00:400:r 2> out
Found vdev: /dev/dsk/c0t1d0s0
[root@server ~]$ file out   
out: GIF file, v89

Because real data is involved I had to cover up most of the above, but you can see how the methods we learned above were used to gain a positive result. Normal means of accessing the file failed miserably, but using zdb -R I dumped the file out. As a verification I opened the GIF in an image viewer and sure enough it looks perfect!

This is a lot to digest, but this is about as simple a primer to zdb as your going to find. Hopefully I’ve given you a solid grasp of the fundamentals so that you can experiment on your own.

Where do you go from here? As noted before, I recommend you now check out the following:

Max Bruning’s ZFS On-Disk Format Using mdb and zdb: Video presentation from the OpenSolaris Developer Conference in Prague on June 28, 2008. An absolute must watch for the hardcore ZFS enthusiast. Warning, may cause your head to explode!
Marcelo Leal’s 5 Part ZFS Internals Series. Leal has tremendous courage to post these, he’s doing tremendous work! Read it!

Good luck and happy zdb’ing…. don’t tell Sun. 🙂