Understanding ZFS: Disk Space Discrepancies

Posted on January 21, 2009

Here’s a good ZFS Trivia question to bewilder your friends and common question: Why do the two following outputs for the same ZFS pool disagree?

# zfs list zones
NAME    USED  AVAIL  REFER  MOUNTPOINT
zones   634G   185G    69K  /zones

# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
zones                   832G    634G    198G    76%  ONLINE     -

If you add up USED and AVAIL from the zfs list output you get 819GB. So why does zfs list say we have 819GB but zpool list says we have a 832GB pool?

This is a question I have tried to answer in the past using zdb, quite unsuccesfully. I found that answer while digging through the ZFS code (dsl_pool.c):

    408 uint64_t
    409 dsl_pool_adjustedsize(dsl_pool_t *dp, boolean_t netfree)
    410 {
    411         uint64_t space, resv;
    412 
    413         /*
    414          * Reserve about 1.6% (1/64), or at least 32MB, for allocation
    415          * efficiency.
    416          * XXX The intent log is not accounted for, so it must fit
    417          * within this slop.
    418          *
    419          * If we're trying to assess whether it's OK to do a free,
    420          * cut the reservation in half to allow forward progress
    421          * (e.g. make it possible to rm(1) files from a full pool).
    422          */
    423         space = spa_get_dspace(dp->dp_spa);
    424         resv = MAX(space >> 6, SPA_MINDEVSIZE >> 1);
    425         if (netfree)
    426                 resv >>= 1;
    427 
    428         return (space - resv);
    429 }

So 1/64th of the pool is reserved? It makes sense, a copy-on-write filesystem is in trouble if it truly hits 100% Used. But do those numbers fit?

1/64th of ZPool:     832G * 0.016 = 13.3GB
Output Discrepency:  832G - 819GB = 13GB

Right on the money.

Therefore, let it be known! Above the capacity lost to whatever RAID scheme you choose, you will loose 1/64th of the pool to reserve. Based on prior experiments, I can verify definitively, when you hit 100% according to zfs list you can not write any more data. Therefore, never report or monitor ZFS capacity based on zpool list! As a point of interest, df also invokes this adjustedsize function and therefore outputs correct numbers. If your monitoring system tracks disk capacity based on df no change is needed.