I’ve been deeply concerned about the number of people who continue to use iostat as the means to universally judge IO as “good” or “bad”. Before I explain why, lets review iostat.
# iostat -xnM c0t1d0 1
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
19.9 240.9 2.1 18.4 0.0 13.7 0.2 52.4 4 56 c0t1d0
127.2 0.0 14.7 0.0 0.0 1.0 0.1 7.7 2 78 c0t1d0
116.0 375.0 13.6 21.6 0.1 3.2 0.1 6.5 7 82 c0t1d0
27.0 407.0 2.6 30.8 0.1 9.3 0.1 21.5 6 99 c0t1d0
95.0 6.0 11.7 0.2 0.0 0.9 0.1 8.6 1 78 c0t1d0
^C
The first 4 columns we can agree on, reads and writes per second. The following 6 columns get the concern. These are 3 different ways of viewing essentially the same data, the active queue (sent to device) and wait queue (waiting to be sent). Note that all these are based on Kstats and can be easily re-formulated into custom tools.
Universal “rules of thumb” regarding queues are very dangerous. I’ve heard such ridiculous suggestions as anything over 5% busy is a problem. The busy time simply denotes the quantity of time that IO was active… thus 100% busy means your doing a lot of IO, not that its slow. Over simplifying the interpretation is as old skool as suggesting that a CPU more than 75% busy needs to be upgraded, which is moronic (but sells a lot of servers).
When it comes to iostat you must carefully balance the numbers of IOs, size of IOs together with the service time and come to a conclusion based on that. If your doing a large streaming write you would expect to see 100% busy but very high throughput. Naturally, the great enemy of storage performance is random workloads that require a lot of head movement, in which case those seek times will kill you and things start to back up, in those cases tune your app… application caching is always a bigger win than adding spindles.
In interactive shared server environments the closest to a rule I’ve ever provided was that, in my experience, active services times below 30ms are optimal, between 30ms and 100ms worry some and higher than 100ms means that someone out there is probly unhappy. When IO’s are regularly taking more than 100ms to complete is likely that the next fellow to type “ls” in an uncached directory is going to be pissed off.
Now, all this gets more complicated with ZFS in the mix. You probly have heard that ZFS is transactional and as a result is always consistent-on-disk. But few really spend time thinking that through. Similar to an OLTP database, transactions are created, work is done, and then finally committed. This commit sends the transaction (tx) into a transaction group (txg) for “sync” to disk. At any given time there are 3 transaction groups: one in an open state accepting transactions, one is a quiescence state ready for sync, and one being sync’ed to disk. (For the sake of simplicity I’ll leave discussion regarding O_DSYNC synchronous writes and ZIL out of this discussion, for now.)
Between these transaction groups gathering writes in memory for orderly flush to disk and the ARC filesystem cache, most of your run of the mill IO is going back and forth between memory. I have a great many machines servicing more than 100,000 read ops per second without a single resulting physical read IO. ZFS efficiency is truly incredible. So in this way, the physical metrics can have little to nothing to do with the actual user-experience making typical tuning based on iostat highly suspect, if not entirely meaningless.
So, first things first. On a ZFS system never look at iostat alone. Always open 2 terminals side-by-side and in one terminal watch fsstat zfs 1 and in another watch iostat -xn 1 (or 10 seconds, whatever your happy with). By watching both of these you’ll get a better idea of whats really going on, and I expect that you’ll be impressed by what you see.
As for async writes. What I really would like to see is how these transaction groups are doing. How often are transaction groups sync’ing to disk, how much are they sync’ing, and how much time is there in between. Prior to snv_87 transaction groups would flush upon fullness (1/8th of system memory) or a txg_time tunable defaulted to 5 seconds. As a result, if you’ve looked at a system running something earlier than snv_87 and saw IO “spike” every 5 seconds, this is why… its normal and healthy. In snv_87 a new ZFS Write Throttle was introduced and among the changes the sync timer got pushed out to 30 seconds. So, likewise, if you have a box post-87 that “spikes” every 30 seconds you have a very healthy system.
Knowing this is all good and well, but I’d like to see it. After spending a good amount of time in the code I realized that spa_sync() is the function to watch. Its whats responsible for actually sync’ing the txg to disk (God Bless DTrace stack[] aggregations!). With this knowledge I wrote up a Dscript that I was proud of, but, of course, now that I knew what to look for, found that Roch wrote up essentially the same thing 2 years ago…. never the less, I made a tweek and here it is:
#!/usr/sbin/dtrace -qs
/*
* spa_sync.d - ROCH http://blogs.sun.com/roch/entry/128k_suffice
* mods by benr
*
* Measure I/O throughput as generated by spa_sync
* Between the spa_sync entry and return probe
* I count all I/O and bytes going through bdev_strategy.
* This is a lower bound on what the device can do since
* some aspects of spa_sync are non-concurrent I/Os.
*/
BEGIN {
tt = 0; /* timestamp */
b = 0; /* Bytecount */
cnt = 0; /* iocount */
}
spa_sync:entry/(self->t == 0) && (tt == 0)/{
b = 0; /* reset the I/O byte count */
cnt = 0;
tt = timestamp;
self->t = 1;
printf("%Y", walltimestamp);
}
spa_sync:return
/(self->t == 1) && (tt != 0)/
{
this->delta = (timestamp-tt);
this->cnt = (cnt == 0) ? 1 : cnt; /* avoid divide by 0 */
printf("t: %d MB; %d ms of spa_sync; avg sz : %d KB; throughput %d MB/sn",
b / 1048576,
this->delta / 1000000,
b / this->cnt / 1024,
(b * 1000000000) / (this->delta * 1048676));
tt = 0;
self->t = 0;
}
/* We only count I/O issued during an spa_sync */
bdev_strategy:entry
/tt != 0/
{
cnt ++;
b += (args[0]->b_bcount);
}
Here is a sample output, pre-snv_87:
# ./spa_sync.d 2009 Jan 23 06:12:28 : 44 MB; 743 ms of spa_sync; avg sz : 68 KB; throughput 59 MB/s 2009 Jan 23 06:12:33 : 81 MB; 1716 ms of spa_sync; avg sz : 79 KB; throughput 47 MB/s 2009 Jan 23 06:12:38 : 45 MB; 736 ms of spa_sync; avg sz : 65 KB; throughput 61 MB/s 2009 Jan 23 06:12:43 : 41 MB; 700 ms of spa_sync; avg sz : 67 KB; throughput 59 MB/s 2009 Jan 23 06:12:48 : 56 MB; 1287 ms of spa_sync; avg sz : 63 KB; throughput 43 MB/s 2009 Jan 23 06:12:53 : 35 MB; 668 ms of spa_sync; avg sz : 65 KB; throughput 52 MB/s 2009 Jan 23 06:12:58 : 61 MB; 1147 ms of spa_sync; avg sz : 62 KB; throughput 53 MB/s 2009 Jan 23 06:13:03 : 41 MB; 624 ms of spa_sync; avg sz : 60 KB; throughput 67 MB/s 2009 Jan 23 06:13:08 : 37 MB; 658 ms of spa_sync; avg sz : 60 KB; throughput 56 MB/s 2009 Jan 23 06:13:13 : 59 MB; 1035 ms of spa_sync; avg sz : 68 KB; throughput 57 MB/s ^C
Notice it hitting the 5 second mark nicely. This output is significantly more telling and encouraging than simply looking at iostat alone.
This of course also reminds us… Roch Rocks.
i was surprised to see that the dtrace script completely hung my thumper with NexentaStor installed on it after running for 2 minutes.
i verified with latest v1.1.5 of just released NexentaStor software – it runs OK
Well….i was surprised to see that the dtrace script completely hung my thumper with NexentaStor installed on it after running for 2 minutes.
The examples you give of ZFS, are in fact perfect for illustrating this point.
cheap ugg
ugg outlet
ugg boots
cheap uggs online
cheap ugg online
cheap ugg boots online
http://www.cheapuggsonline.net
[[[http://www.inuggshopping.com]]] Uggs boots(shoes)Store-free shipping! Discount40%-50% off.
Welcome to have a look!
ugg fashion shoes–classic,cove,nightfall,sundance,mini…
links of london
influensive as your information. I am very glad to see such information which I was searching for a long time.This made very glad Smile..
very good evden eve nakliyat
evden eve nakliyat
http://www.evdenevenakliyatt.net/
http://www.seckinnakliyat.com/
thank you
nice post here, if you want to know more about best mbts and ugg boots, just click here
http://www.bestmbtshoes.com
http://www.uggbootsroom.com
thank you very muchh very good post
nice post here, if you want to know more about best mbts and ugg boots, just click here
çok mükemmel
This is just a theory but I think it’s pretty sound. http://www.dragonkicks.com
I read your article with great pleasure. This wonderful article thank you for sharing with us
Sounds cool! polo T-shirts
i agree with you
thank you,I learn a lot
Free shipping and top quality,comfortable fit
http://www.gucci-shoes-bags.com
http://www.guccinewshop.com
Dear friends welcome to our store: we have a specail offer now which is once you buy any product you can get a free gift as well, these gifts are in the gift area, you can free to choose and please add to the shopping cart, we will send it together with your purchased product. Thanks!
The new high can browse your article
The new high can browse your article
http://www.towatches.com/Discount-Watches/
If you are looking for the best brand new laptop batteries at the most preferred price, you have come to the right place. http://www.adapterlist.com/toshiba/satellite-a100.htm We provide the highest quality Battery at the lowest price with the highest level of service, all in a secure and convenient platform.
thank you for this outstanding article.I thought certino was the best technologh for laptop battery performance.We specialize in http://www.globallaptopbattery.co.uk/dell/inspiron-6400.htm ,laptop AC adapters. All our products are brand new, with the excellent service from our laptop battery of customer service team.
thanks good post
thanks good post
Helo, your blog is really good, I like it very much!By the way, if you like nike chaussures tn you can come here to have a look!
http://www.tnchaussurescom.com
http://www.sunglassesol.net
Buy Nike Air Max 90 Shoes just $45-55 USD in
http://www.iofferitems.com, 40-70% Off. Cheap Air Max 90 Shoes, Free Shipping! Buy Air Max 90 Now!
7 days herbal slim7 days herbal slim
thanks for this artice very good
http://www.air jordan shoes cheap.com/
[[http://www.sportsjerseysshop.com,]]
http://www.vibramfivefinger.us/ vibram five fingers
[URL]www.nike-shox.net[/URL] nike shox
nike shoes
[url=http://www.2buybags.com/replica-gucci-purses.html]replica Gucci Purses[/url]
http://www.ushoes.net : Nike Air Force Ones
http://www.sxhgz.com : nike shoes
It’s very nice! I love what you wrote.
I think we can make friends.
I totally love this article.
bursa evden eve nakliyat firmalari, bursa evdeneve nakliyat yapan firmalar.
I know you probably get a lot of comments like this Air Jordan 2010, but just wanted to let you know that I really appreciate the work you have put into the blog. I was wondering if I could put a link on my blog because I am sure my followers would love to read it Air jordan 1. Let me know.
Your post is awesome, but why not take a look at our site: http://www.p90xwork.com
http://www.avitoipadconverter.com AVI to iPad Converter is just the most suitable tool for iPad which let iPad user freely convert various video or audio files to iPad just with simple clicks.
http://www.magicdvdtoipad.com
thank you ,I have learned a lot. http://www.lovemypursemall.com
Cheap UGG Boots sale
Cheap UGG Sheepskin Boots wholesalehttp://www.2010-uggsnowboots.com/index.asp?showpage=productlist&cid=226
With best quality and cheap price
2010 ugg boots http://www.2010-uggsnowboots.com/index.asp?showpage=productlist&cid=215
ugg classic short
Once we get there, so many wonderful dreams will come true and the pieces of our black nike air max shoes lives will fit together like a completed jigsaw puzzle. http://www.sellnikeairmax.com/
Pretty good post, this is one of the best posts that I’ve ever seen! This is a great site and I have to congratulate you on the content. It’s so nice Article. I appreciate it.http://www.mbtshoppes.com
The town’s poor seem to me often to live the most independent air max classic womens shoes lives of any. May be they are simply great nike air max enough to receive without misgiving. http://www.sellnikeairmax.com/ LIJ
Nice post.Thank you for taking the time to publish this information very useful! I’m still waiting for some interesting thoughts from your side in your next post thanks.
One day they will understand you.