Understanding ZFS: Transaction Groups & Disk Performance

Posted on January 23, 2009

I’ve been deeply concerned about the number of people who continue to use iostat as the means to universally judge IO as “good” or “bad”. Before I explain why, lets review iostat.

# iostat -xnM c0t1d0 1  
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   19.9  240.9    2.1   18.4  0.0 13.7    0.2   52.4   4  56 c0t1d0
  127.2    0.0   14.7    0.0  0.0  1.0    0.1    7.7   2  78 c0t1d0
  116.0  375.0   13.6   21.6  0.1  3.2    0.1    6.5   7  82 c0t1d0
   27.0  407.0    2.6   30.8  0.1  9.3    0.1   21.5   6  99 c0t1d0
   95.0    6.0   11.7    0.2  0.0  0.9    0.1    8.6   1  78 c0t1d0

The first 4 columns we can agree on, reads and writes per second. The following 6 columns get the concern. These are 3 different ways of viewing essentially the same data, the active queue (sent to device) and wait queue (waiting to be sent). Note that all these are based on Kstats and can be easily re-formulated into custom tools.

Universal “rules of thumb” regarding queues are very dangerous. I’ve heard such ridiculous suggestions as anything over 5% busy is a problem. The busy time simply denotes the quantity of time that IO was active… thus 100% busy means your doing a lot of IO, not that its slow. Over simplifying the interpretation is as old skool as suggesting that a CPU more than 75% busy needs to be upgraded, which is moronic (but sells a lot of servers).

When it comes to iostat you must carefully balance the numbers of IOs, size of IOs together with the service time and come to a conclusion based on that. If your doing a large streaming write you would expect to see 100% busy but very high throughput. Naturally, the great enemy of storage performance is random workloads that require a lot of head movement, in which case those seek times will kill you and things start to back up, in those cases tune your app… application caching is always a bigger win than adding spindles.

In interactive shared server environments the closest to a rule I’ve ever provided was that, in my experience, active services times below 30ms are optimal, between 30ms and 100ms worry some and higher than 100ms means that someone out there is probly unhappy. When IO’s are regularly taking more than 100ms to complete is likely that the next fellow to type “ls” in an uncached directory is going to be pissed off.

Now, all this gets more complicated with ZFS in the mix. You probly have heard that ZFS is transactional and as a result is always consistent-on-disk. But few really spend time thinking that through. Similar to an OLTP database, transactions are created, work is done, and then finally committed. This commit sends the transaction (tx) into a transaction group (txg) for “sync” to disk. At any given time there are 3 transaction groups: one in an open state accepting transactions, one is a quiescence state ready for sync, and one being sync’ed to disk. (For the sake of simplicity I’ll leave discussion regarding O_DSYNC synchronous writes and ZIL out of this discussion, for now.)

Between these transaction groups gathering writes in memory for orderly flush to disk and the ARC filesystem cache, most of your run of the mill IO is going back and forth between memory. I have a great many machines servicing more than 100,000 read ops per second without a single resulting physical read IO. ZFS efficiency is truly incredible. So in this way, the physical metrics can have little to nothing to do with the actual user-experience making typical tuning based on iostat highly suspect, if not entirely meaningless.

So, first things first. On a ZFS system never look at iostat alone. Always open 2 terminals side-by-side and in one terminal watch fsstat zfs 1 and in another watch iostat -xn 1 (or 10 seconds, whatever your happy with). By watching both of these you’ll get a better idea of whats really going on, and I expect that you’ll be impressed by what you see.

As for async writes. What I really would like to see is how these transaction groups are doing. How often are transaction groups sync’ing to disk, how much are they sync’ing, and how much time is there in between. Prior to snv_87 transaction groups would flush upon fullness (1/8th of system memory) or a txg_time tunable defaulted to 5 seconds. As a result, if you’ve looked at a system running something earlier than snv_87 and saw IO “spike” every 5 seconds, this is why… its normal and healthy. In snv_87 a new ZFS Write Throttle was introduced and among the changes the sync timer got pushed out to 30 seconds. So, likewise, if you have a box post-87 that “spikes” every 30 seconds you have a very healthy system.

Knowing this is all good and well, but I’d like to see it. After spending a good amount of time in the code I realized that spa_sync() is the function to watch. Its whats responsible for actually sync’ing the txg to disk (God Bless DTrace stack[] aggregations!). With this knowledge I wrote up a Dscript that I was proud of, but, of course, now that I knew what to look for, found that Roch wrote up essentially the same thing 2 years ago…. never the less, I made a tweek and here it is:

#!/usr/sbin/dtrace -qs

 * spa_sync.d - ROCH http://blogs.sun.com/roch/entry/128k_suffice 
 * mods by benr
 * Measure I/O throughput as generated by spa_sync 
 * Between the spa_sync entry and return probe
 * I count all I/O and bytes going through bdev_strategy.
 * This is a lower bound on what the device can do since
 * some aspects of spa_sync are non-concurrent I/Os.

        tt = 0; /* timestamp */
        b = 0; /* Bytecount */
        cnt = 0; /* iocount */

spa_sync:entry/(self->t == 0) && (tt == 0)/{
        b = 0; /* reset the I/O byte count */
        cnt = 0;
        tt = timestamp; 
        self->t = 1;
        printf("%Y", walltimestamp);

/(self->t == 1) && (tt != 0)/
        this->delta = (timestamp-tt);
        this->cnt = (cnt == 0) ? 1 : cnt; /* avoid divide by 0 */
        printf("t: %d MB; %d ms of spa_sync; avg sz : %d KB; throughput %d MB/sn",
                b / 1048576,
                this->delta / 1000000, 
                b / this->cnt / 1024,
                (b * 1000000000) / (this->delta * 1048676)); 
        tt = 0;
        self->t = 0;

/* We only count I/O issued during an spa_sync */
/tt != 0/
        cnt ++;
        b += (args[0]->b_bcount);

Here is a sample output, pre-snv_87:

# ./spa_sync.d 
2009 Jan 23 06:12:28    : 44 MB; 743 ms of spa_sync; avg sz : 68 KB; throughput 59 MB/s
2009 Jan 23 06:12:33    : 81 MB; 1716 ms of spa_sync; avg sz : 79 KB; throughput 47 MB/s
2009 Jan 23 06:12:38    : 45 MB; 736 ms of spa_sync; avg sz : 65 KB; throughput 61 MB/s
2009 Jan 23 06:12:43    : 41 MB; 700 ms of spa_sync; avg sz : 67 KB; throughput 59 MB/s
2009 Jan 23 06:12:48    : 56 MB; 1287 ms of spa_sync; avg sz : 63 KB; throughput 43 MB/s
2009 Jan 23 06:12:53    : 35 MB; 668 ms of spa_sync; avg sz : 65 KB; throughput 52 MB/s
2009 Jan 23 06:12:58    : 61 MB; 1147 ms of spa_sync; avg sz : 62 KB; throughput 53 MB/s
2009 Jan 23 06:13:03    : 41 MB; 624 ms of spa_sync; avg sz : 60 KB; throughput 67 MB/s
2009 Jan 23 06:13:08    : 37 MB; 658 ms of spa_sync; avg sz : 60 KB; throughput 56 MB/s
2009 Jan 23 06:13:13    : 59 MB; 1035 ms of spa_sync; avg sz : 68 KB; throughput 57 MB/s

Notice it hitting the 5 second mark nicely. This output is significantly more telling and encouraging than simply looking at iostat alone.

This of course also reminds us… Roch Rocks.