Understanding ZFS: Transaction Groups & Disk Performance

23 Jan '09 - 06:42 by benr

I've been deeply concerned about the number of people who continue to use iostat as the means to universally judge IO as "good" or "bad". Before I explain why, lets review iostat.

# iostat -xnM c0t1d0 1  
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   19.9  240.9    2.1   18.4  0.0 13.7    0.2   52.4   4  56 c0t1d0
  127.2    0.0   14.7    0.0  0.0  1.0    0.1    7.7   2  78 c0t1d0
  116.0  375.0   13.6   21.6  0.1  3.2    0.1    6.5   7  82 c0t1d0
   27.0  407.0    2.6   30.8  0.1  9.3    0.1   21.5   6  99 c0t1d0
   95.0    6.0   11.7    0.2  0.0  0.9    0.1    8.6   1  78 c0t1d0
^C

The first 4 columns we can agree on, reads and writes per second. The following 6 columns get the concern. These are 3 different ways of viewing essentially the same data, the active queue (sent to device) and wait queue (waiting to be sent). Note that all these are based on Kstats and can be easily re-formulated into custom tools.

Universal "rules of thumb" regarding queues are very dangerous. I've heard such ridiculous suggestions as anything over 5% busy is a problem. The busy time simply denotes the quantity of time that IO was active... thus 100% busy means your doing a lot of IO, not that its slow. Over simplifying the interpretation is as old skool as suggesting that a CPU more than 75% busy needs to be upgraded, which is moronic (but sells a lot of servers).

When it comes to iostat you must carefully balance the numbers of IOs, size of IOs together with the service time and come to a conclusion based on that. If your doing a large streaming write you would expect to see 100% busy but very high throughput. Naturally, the great enemy of storage performance is random workloads that require a lot of head movement, in which case those seek times will kill you and things start to back up, in those cases tune your app... application caching is always a bigger win than adding spindles.

In interactive shared server environments the closest to a rule I've ever provided was that, in my experience, active services times below 30ms are optimal, between 30ms and 100ms worry some and higher than 100ms means that someone out there is probly unhappy. When IO's are regularly taking more than 100ms to complete is likely that the next fellow to type "ls" in an uncached directory is going to be pissed off.

Now, all this gets more complicated with ZFS in the mix. You probly have heard that ZFS is transactional and as a result is always consistent-on-disk. But few really spend time thinking that through. Similar to an OLTP database, transactions are created, work is done, and then finally committed. This commit sends the transaction (tx) into a transaction group (txg) for "sync" to disk. At any given time there are 3 transaction groups: one in an open state accepting transactions, one is a quiescence state ready for sync, and one being sync'ed to disk. (For the sake of simplicity I'll leave discussion regarding O_DSYNC synchronous writes and ZIL out of this discussion, for now.)

Between these transaction groups gathering writes in memory for orderly flush to disk and the ARC filesystem cache, most of your run of the mill IO is going back and forth between memory. I have a great many machines servicing more than 100,000 read ops per second without a single resulting physical read IO. ZFS efficiency is truly incredible. So in this way, the physical metrics can have little to nothing to do with the actual user-experience making typical tuning based on iostat highly suspect, if not entirely meaningless.

So, first things first. On a ZFS system never look at iostat alone. Always open 2 terminals side-by-side and in one terminal watch fsstat zfs 1 and in another watch iostat -xn 1 (or 10 seconds, whatever your happy with). By watching both of these you'll get a better idea of whats really going on, and I expect that you'll be impressed by what you see.

As for async writes. What I really would like to see is how these transaction groups are doing. How often are transaction groups sync'ing to disk, how much are they sync'ing, and how much time is there in between. Prior to snv_87 transaction groups would flush upon fullness (1/8th of system memory) or a txg_time tunable defaulted to 5 seconds. As a result, if you've looked at a system running something earlier than snv_87 and saw IO "spike" every 5 seconds, this is why... its normal and healthy. In snv_87 a new ZFS Write Throttle was introduced and among the changes the sync timer got pushed out to 30 seconds. So, likewise, if you have a box post-87 that "spikes" every 30 seconds you have a very healthy system.

Knowing this is all good and well, but I'd like to see it. After spending a good amount of time in the code I realized that spa_sync() is the function to watch. Its whats responsible for actually sync'ing the txg to disk (God Bless DTrace stack[] aggregations!). With this knowledge I wrote up a Dscript that I was proud of, but, of course, now that I knew what to look for, found that Roch wrote up essentially the same thing 2 years ago.... never the less, I made a tweek and here it is:

#!/usr/sbin/dtrace -qs

/*
 * spa_sync.d - ROCH http://blogs.sun.com/roch/entry/128k_suffice 
 * mods by benr
 * 
 * Measure I/O throughput as generated by spa_sync 
 * Between the spa_sync entry and return probe
 * I count all I/O and bytes going through bdev_strategy.
 * This is a lower bound on what the device can do since
 * some aspects of spa_sync are non-concurrent I/Os.
 */

BEGIN {
        tt = 0; /* timestamp */
        b = 0; /* Bytecount */
        cnt = 0; /* iocount */
} 

spa_sync:entry/(self->t == 0) && (tt == 0)/{
        b = 0; /* reset the I/O byte count */
        cnt = 0;
        tt = timestamp; 
        self->t = 1;
        printf("%Y", walltimestamp);
}

spa_sync:return
/(self->t == 1) && (tt != 0)/
{
        this->delta = (timestamp-tt);
        this->cnt = (cnt == 0) ? 1 : cnt; /* avoid divide by 0 */
        printf("t: %d MB; %d ms of spa_sync; avg sz : %d KB; throughput %d MB/sn",
                b / 1048576,
                this->delta / 1000000, 
                b / this->cnt / 1024,
                (b * 1000000000) / (this->delta * 1048676)); 
        tt = 0;
        self->t = 0;
}

/* We only count I/O issued during an spa_sync */
bdev_strategy:entry 
/tt != 0/
{ 
        cnt ++;
        b += (args[0]->b_bcount);
}

Here is a sample output, pre-snv_87:

# ./spa_sync.d 
2009 Jan 23 06:12:28    : 44 MB; 743 ms of spa_sync; avg sz : 68 KB; throughput 59 MB/s
2009 Jan 23 06:12:33    : 81 MB; 1716 ms of spa_sync; avg sz : 79 KB; throughput 47 MB/s
2009 Jan 23 06:12:38    : 45 MB; 736 ms of spa_sync; avg sz : 65 KB; throughput 61 MB/s
2009 Jan 23 06:12:43    : 41 MB; 700 ms of spa_sync; avg sz : 67 KB; throughput 59 MB/s
2009 Jan 23 06:12:48    : 56 MB; 1287 ms of spa_sync; avg sz : 63 KB; throughput 43 MB/s
2009 Jan 23 06:12:53    : 35 MB; 668 ms of spa_sync; avg sz : 65 KB; throughput 52 MB/s
2009 Jan 23 06:12:58    : 61 MB; 1147 ms of spa_sync; avg sz : 62 KB; throughput 53 MB/s
2009 Jan 23 06:13:03    : 41 MB; 624 ms of spa_sync; avg sz : 60 KB; throughput 67 MB/s
2009 Jan 23 06:13:08    : 37 MB; 658 ms of spa_sync; avg sz : 60 KB; throughput 56 MB/s
2009 Jan 23 06:13:13    : 59 MB; 1035 ms of spa_sync; avg sz : 68 KB; throughput 57 MB/s
^C

Notice it hitting the 5 second mark nicely. This output is significantly more telling and encouraging than simply looking at iostat alone.

This of course also reminds us... Roch Rocks.


- - C O M M E N T S - -

i was surprised to see that the dtrace script completely hung my thumper with NexentaStor installed on it after running for 2 minutes.

Mike (Email) - 23 January '09 - 23:09

i verified with latest v1.1.5 of just released NexentaStor software – it runs OK

erast (Email) - 27 February '09 - 22:34

Well….i was surprised to see that the dtrace script completely hung my thumper with NexentaStor installed on it after running for 2 minutes.

Chris (Email) (URL) - 10 May '09 - 07:37

The examples you give of ZFS, are in fact perfect for illustrating this point.

apri (Email) (URL) - 12 July '09 - 17:18

It was a very nice idea! Just wanna say thank you for the information you have shared. Just continue writing this kind of post. I will be your loyal reader. Thanks again.

Christian louboutin shoes (Email) (URL) - 27 October '09 - 02:41

Thank you very much!

ed hardy clothing (Email) (URL) - 09 November '09 - 00:32

Thanks!!!

Tiffany jewellery (Email) (URL) - 14 November '09 - 02:34

yeah ,i think so

christian louboutin (Email) (URL) - 15 November '09 - 07:31

Great post! Hope to be better. Better means more features.
good post,I think so!
Thanks for your information, i have read it, very good!
Bing is a really overlord!! support Bing~~
This is great news. Best of luck for the future and keep up the good work.

links of london (Email) (URL) - 17 November '09 - 03:16

Then do you know the christian louboutin,so come to see these christian louboutin shoes.

christian louboutin (Email) (URL) - 18 November '09 - 12:03

Nice post. Publishing, consulting and regular life all kind of

merge on the web and it is sometimes hard to know which hat you

have on.

Christian Louboutin Shoes (Email) (URL) - 20 November '09 - 06:12

Thanks for your sharing things! They’re gorgeous! We’re getting ready to launch our new online Christian Louboutin for smaller busted women, so I can definately appreciate all the hard work you’ve put into expanding your business! I know it’s quite a fete!!! Congratulations! Looking forward to seeing your shoes at more events and websites!
Heat can’t distract from pure fabulosity… Love those shoes.When do they go on sale again? and please tell me they will be in size 11… The big foot girls also love choosing christian louboutin too! Because they are so fit you——girls,they will be your best choice!

christian louboutin shoes (Email) (URL) - 06 December '09 - 02:25

cheap ugg
ugg outlet
ugg boots
cheap uggs online
cheap ugg online
cheap ugg boots online
[[http://www.cheapuggsonline.net]]

cheap uggs online (Email) (URL) - 17 December '09 - 06:39

[..]]]]]]]] Uggs boots(shoes)Store-free shipping! Discount40%-50% off.
Welcome to have a look!
ugg fashion shoes—classic,cove,nightfall,sundance,mini…

UGGBOOTS-free shipping! Discount40%-50% off (Email) (URL) - 30 December '09 - 07:43

Just one question: how to add your blog into my rrs reader, thanks so much.

christian louboutin (Email) (URL) - 03 January '10 - 06:27

they will be in size 11… The big foot girls also love choosing christian louboutin too!

brand clothing (Email) (URL) - 03 January '10 - 07:42

I love reading about all the success and accomplishments you have achieved. You’re truly an inspiration.And the same as christian louboutin, it also a good style for you and I believe you will love it .

christian louboutin (Email) (URL) - 05 January '10 - 07:29

This may be a perfect example of information asymmetry and adverse selection in insurance. I hope all textbook authors and legislators notice.

christian louboutin shoes (Email) (URL) - 06 January '10 - 08:50

Good work! Your post/article is an excellent example of why I keep comming back to read your excellent quality content that is forever updated. Thank you!roulette onlinepoker sitesblackjack onlinevideo poker onlinedivx movie downloads

christian louboutin (Email) (URL) - 08 January '10 - 09:03

links of london

links of london (Email) (URL) - 11 January '10 - 11:23

Thank you for the sensible critique. Me & my neighbour were preparing to do some research about that. We got a good book on that matter from local library and most books where not as influensive as your information. I am very glad to see such information which I was searching for a long time.This made very glad Smile..

louboutin shoes (Email) (URL) - 13 January '10 - 03:48

influensive as your information. I am very glad to see such information which I was searching for a long time.This made very glad Smile..

MBT Shoes Sale (Email) (URL) - 14 January '10 - 08:58

many thanks

M65 Jacket (Email) (URL) - 21 January '10 - 02:01

[[http://www.buykamagra.com]] buy kamagra
[[http://www.viagracialis.com]] viagra cialis

M65 Jacket (Email) (URL) - 21 January '10 - 02:05

very good evden eve nakliyat
evden eve nakliyat

evden eve nakliyat - 27 January '10 - 03:45

[[http://www.evdenevenakliyatt.net/]]
[[http://www.seckinnakliyat.com/]]

evden eve nakliyat - 27 January '10 - 03:47

New battery review module surface is black, not because of the positive electrode exists, the module is likely to be integrated back contact silicon solar cells, silicon, where the role is often to extend the service life of equipment.Mitsubishi motors in October 2009 28-30 at the yokohama exhibition center held the Pacific “GreenDevice2009″ displayed on the high efficiency solar camera battery 19.1% respectively. In the square 15cm in polysilicon solar cells, realized the highest efficiency. The research achievements in 2009 September 2009 and the application of physics to the international society in October EUPVSEC “, “said in a statement.

battery review (Email) (URL) - 30 January '10 - 03:34

thank you

evden eve nakliyat istanbul (Email) (URL) - 05 February '10 - 19:22

nice post here, if you want to know more about best mbts and ugg boots, just click here
[[http://www.bestmbtshoes.com]]
[[http://www.uggbootsroom.com]]

best mbt (Email) (URL) - 06 February '10 - 03:27

thank you

puma sneaker (Email) (URL) - 06 February '10 - 07:27

thank you very muchh very good post

sohbet (Email) (URL) - 07 February '10 - 09:19

Personal information





Remember your information?
Comment

Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.


^M