Understanding ZFS: Transaction Groups & Disk Performance

I’ve been deeply concerned about the number of people who continue to use iostat as the means to universally judge IO as “good” or “bad”. Before I explain why, lets review iostat.

# iostat -xnM c0t1d0 1
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   19.9  240.9    2.1   18.4  0.0 13.7    0.2   52.4   4  56 c0t1d0
  127.2    0.0   14.7    0.0  0.0  1.0    0.1    7.7   2  78 c0t1d0
  116.0  375.0   13.6   21.6  0.1  3.2    0.1    6.5   7  82 c0t1d0
   27.0  407.0    2.6   30.8  0.1  9.3    0.1   21.5   6  99 c0t1d0
   95.0    6.0   11.7    0.2  0.0  0.9    0.1    8.6   1  78 c0t1d0
^C

The first 4 columns we can agree on, reads and writes per second. The following 6 columns get the concern. These are 3 different ways of viewing essentially the same data, the active queue (sent to device) and wait queue (waiting to be sent). Note that all these are based on Kstats and can be easily re-formulated into custom tools.

Universal “rules of thumb” regarding queues are very dangerous. I’ve heard such ridiculous suggestions as anything over 5% busy is a problem. The busy time simply denotes the quantity of time that IO was active… thus 100% busy means your doing a lot of IO, not that its slow. Over simplifying the interpretation is as old skool as suggesting that a CPU more than 75% busy needs to be upgraded, which is moronic (but sells a lot of servers).

When it comes to iostat you must carefully balance the numbers of IOs, size of IOs together with the service time and come to a conclusion based on that. If your doing a large streaming write you would expect to see 100% busy but very high throughput. Naturally, the great enemy of storage performance is random workloads that require a lot of head movement, in which case those seek times will kill you and things start to back up, in those cases tune your app… application caching is always a bigger win than adding spindles.

In interactive shared server environments the closest to a rule I’ve ever provided was that, in my experience, active services times below 30ms are optimal, between 30ms and 100ms worry some and higher than 100ms means that someone out there is probly unhappy. When IO’s are regularly taking more than 100ms to complete is likely that the next fellow to type “ls” in an uncached directory is going to be pissed off.

Now, all this gets more complicated with ZFS in the mix. You probly have heard that ZFS is transactional and as a result is always consistent-on-disk. But few really spend time thinking that through. Similar to an OLTP database, transactions are created, work is done, and then finally committed. This commit sends the transaction (tx) into a transaction group (txg) for “sync” to disk. At any given time there are 3 transaction groups: one in an open state accepting transactions, one is a quiescence state ready for sync, and one being sync’ed to disk. (For the sake of simplicity I’ll leave discussion regarding O_DSYNC synchronous writes and ZIL out of this discussion, for now.)

Between these transaction groups gathering writes in memory for orderly flush to disk and the ARC filesystem cache, most of your run of the mill IO is going back and forth between memory. I have a great many machines servicing more than 100,000 read ops per second without a single resulting physical read IO. ZFS efficiency is truly incredible. So in this way, the physical metrics can have little to nothing to do with the actual user-experience making typical tuning based on iostat highly suspect, if not entirely meaningless.

So, first things first. On a ZFS system never look at iostat alone. Always open 2 terminals side-by-side and in one terminal watch fsstat zfs 1 and in another watch iostat -xn 1 (or 10 seconds, whatever your happy with). By watching both of these you’ll get a better idea of whats really going on, and I expect that you’ll be impressed by what you see.

As for async writes. What I really would like to see is how these transaction groups are doing. How often are transaction groups sync’ing to disk, how much are they sync’ing, and how much time is there in between. Prior to snv_87 transaction groups would flush upon fullness (1/8th of system memory) or a txg_time tunable defaulted to 5 seconds. As a result, if you’ve looked at a system running something earlier than snv_87 and saw IO “spike” every 5 seconds, this is why… its normal and healthy. In snv_87 a new ZFS Write Throttle was introduced and among the changes the sync timer got pushed out to 30 seconds. So, likewise, if you have a box post-87 that “spikes” every 30 seconds you have a very healthy system.

Knowing this is all good and well, but I’d like to see it. After spending a good amount of time in the code I realized that spa_sync() is the function to watch. Its whats responsible for actually sync’ing the txg to disk (God Bless DTrace stack[] aggregations!). With this knowledge I wrote up a Dscript that I was proud of, but, of course, now that I knew what to look for, found that Roch wrote up essentially the same thing 2 years ago…. never the less, I made a tweek and here it is:

#!/usr/sbin/dtrace -qs

/*
 * spa_sync.d - ROCH http://blogs.sun.com/roch/entry/128k_suffice
 * mods by benr
 *
 * Measure I/O throughput as generated by spa_sync
 * Between the spa_sync entry and return probe
 * I count all I/O and bytes going through bdev_strategy.
 * This is a lower bound on what the device can do since
 * some aspects of spa_sync are non-concurrent I/Os.
 */

BEGIN {
        tt = 0; /* timestamp */
        b = 0; /* Bytecount */
        cnt = 0; /* iocount */
} 

spa_sync:entry/(self->t == 0) && (tt == 0)/{
        b = 0; /* reset the I/O byte count */
        cnt = 0;
        tt = timestamp;
        self->t = 1;
        printf("%Y", walltimestamp);
}

spa_sync:return
/(self->t == 1) && (tt != 0)/
{
        this->delta = (timestamp-tt);
        this->cnt = (cnt == 0) ? 1 : cnt; /* avoid divide by 0 */
        printf("t: %d MB; %d ms of spa_sync; avg sz : %d KB; throughput %d MB/sn",
                b / 1048576,
                this->delta / 1000000,
                b / this->cnt / 1024,
                (b * 1000000000) / (this->delta * 1048676));
        tt = 0;
        self->t = 0;
}

/* We only count I/O issued during an spa_sync */
bdev_strategy:entry
/tt != 0/
{
        cnt ++;
        b += (args[0]->b_bcount);
}

Here is a sample output, pre-snv_87:

# ./spa_sync.d
2009 Jan 23 06:12:28    : 44 MB; 743 ms of spa_sync; avg sz : 68 KB; throughput 59 MB/s
2009 Jan 23 06:12:33    : 81 MB; 1716 ms of spa_sync; avg sz : 79 KB; throughput 47 MB/s
2009 Jan 23 06:12:38    : 45 MB; 736 ms of spa_sync; avg sz : 65 KB; throughput 61 MB/s
2009 Jan 23 06:12:43    : 41 MB; 700 ms of spa_sync; avg sz : 67 KB; throughput 59 MB/s
2009 Jan 23 06:12:48    : 56 MB; 1287 ms of spa_sync; avg sz : 63 KB; throughput 43 MB/s
2009 Jan 23 06:12:53    : 35 MB; 668 ms of spa_sync; avg sz : 65 KB; throughput 52 MB/s
2009 Jan 23 06:12:58    : 61 MB; 1147 ms of spa_sync; avg sz : 62 KB; throughput 53 MB/s
2009 Jan 23 06:13:03    : 41 MB; 624 ms of spa_sync; avg sz : 60 KB; throughput 67 MB/s
2009 Jan 23 06:13:08    : 37 MB; 658 ms of spa_sync; avg sz : 60 KB; throughput 56 MB/s
2009 Jan 23 06:13:13    : 59 MB; 1035 ms of spa_sync; avg sz : 68 KB; throughput 57 MB/s
^C

Notice it hitting the 5 second mark nicely. This output is significantly more telling and encouraging than simply looking at iostat alone.

This of course also reminds us… Roch Rocks.

53 Responses to “Understanding ZFS: Transaction Groups & Disk Performance”

  1. Mike says:

    i was surprised to see that the dtrace script completely hung my thumper with NexentaStor installed on it after running for 2 minutes.

  2. erast says:

    i verified with latest v1.1.5 of just released NexentaStor software – it runs OK

  3. Chris says:

    Well….i was surprised to see that the dtrace script completely hung my thumper with NexentaStor installed on it after running for 2 minutes.

  4. apri says:

    The examples you give of ZFS, are in fact perfect for illustrating this point.

  5. cheap ugg
    ugg outlet
    ugg boots
    cheap uggs online
    cheap ugg online
    cheap ugg boots online
    http://www.cheapuggsonline.net

  6. [[[http://www.inuggshopping.com]]] Uggs boots(shoes)Store-free shipping! Discount40%-50% off.
    Welcome to have a look!
    ugg fashion shoes–classic,cove,nightfall,sundance,mini…

  7. influensive as your information. I am very glad to see such information which I was searching for a long time.This made very glad Smile..

  8. best mbt says:

    nice post here, if you want to know more about best mbts and ugg boots, just click here
    http://www.bestmbtshoes.com
    http://www.uggbootsroom.com

  9. sohbet says:

    thank you very muchh very good post

  10. nice post here, if you want to know more about best mbts and ugg boots, just click here

  11. This is just a theory but I think it’s pretty sound. http://www.dragonkicks.com

  12. I read your article with great pleasure. This wonderful article thank you for sharing with us

  13. supra skytop says:

    thank you,I learn a lot

  14. Free shipping and top quality,comfortable fit

  15. gucci says:

    http://www.gucci-shoes-bags.com
    http://www.guccinewshop.com
    Dear friends welcome to our store: we have a specail offer now which is once you buy any product you can get a free gift as well, these gifts are in the gift area, you can free to choose and please add to the shopping cart, we will send it together with your purchased product. Thanks!

  16. towatches says:

    The new high can browse your article

  17. If you are looking for the best brand new laptop batteries at the most preferred price, you have come to the right place. http://www.adapterlist.com/toshiba/satellite-a100.htm We provide the highest quality Battery at the lowest price with the highest level of service, all in a secure and convenient platform.

  18. thank you for this outstanding article.I thought certino was the best technologh for laptop battery performance.We specialize in http://www.globallaptopbattery.co.uk/dell/inspiron-6400.htm ,laptop AC adapters. All our products are brand new, with the excellent service from our laptop battery of customer service team.

  19. tnchaussures says:

    Helo, your blog is really good, I like it very much!By the way, if you like nike chaussures tn you can come here to have a look!
    http://www.tnchaussurescom.com
    http://www.sunglassesol.net

  20. Buy Nike Air Max 90 Shoes just $45-55 USD inhttp://www.iofferitems.com, 40-70% Off. Cheap Air Max 90 Shoes, Free Shipping! Buy Air Max 90 Now!

  21. thanks for this artice very good

  22. nike shox says:

    [URL]www.nike-shox.net[/URL] nike shox

  23. [url=http://www.2buybags.com/replica-gucci-purses.html]replica Gucci Purses[/url]

  24. It’s very nice! I love what you wrote.
    I think we can make friends.

  25. ed hardy says:

    I totally love this article.

  26. bursa evden eve nakliyat firmalari, bursa evdeneve nakliyat yapan firmalar.

  27. Air Jordan says:

    I know you probably get a lot of comments like this Air Jordan 2010, but just wanted to let you know that I really appreciate the work you have put into the blog. I was wondering if I could put a link on my blog because I am sure my followers would love to read it Air jordan 1. Let me know.

  28. Your post is awesome, but why not take a look at our site: http://www.p90xwork.com

  29. AVI to iPad says:

    http://www.avitoipadconverter.com AVI to iPad Converter is just the most suitable tool for iPad which let iPad user freely convert various video or audio files to iPad just with simple clicks.
    http://www.magicdvdtoipad.com

  30. Cheap UGG Boots sale
    Cheap UGG Sheepskin Boots wholesalehttp://www.2010-uggsnowboots.com/index.asp?showpage=productlist&cid=226
    With best quality and cheap price
    2010 ugg boots http://www.2010-uggsnowboots.com/index.asp?showpage=productlist&cid=215
    ugg classic short

  31. Once we get there, so many wonderful dreams will come true and the pieces of our black nike air max shoes lives will fit together like a completed jigsaw puzzle. http://www.sellnikeairmax.com/

  32. Pretty good post, this is one of the best posts that I’ve ever seen! This is a great site and I have to congratulate you on the content. It’s so nice Article. I appreciate it.http://www.mbtshoppes.com

  33. The town’s poor seem to me often to live the most independent air max classic womens shoes lives of any. May be they are simply great nike air max enough to receive without misgiving. http://www.sellnikeairmax.com/ LIJ

  34. wholesale says:

    Nice post.Thank you for taking the time to publish this information very useful! I’m still waiting for some interesting thoughts from your side in your next post thanks.

  35. One day they will understand you.