One of the great mysteries of ZFS is prefetch. Failing to understand these, how they work, and what they intend to do for you, can cause a lot of confusion, so here we’ll dig our fingers into the subject. The first thing to understand is that “ZFS Prefetch” may refer to file-level prefetch and/or the virtual device read-ahead cache; we’ll discuss both here.
VDev Read-Ahead Cache (SPA)
When reading data from spinning media, the bulk of the service time is spent positioning the disk head. Once the head is in place, actually reading data is super speedy. To capitalize on this reality, ZFS’s vdev_cache is a virtual device read ahead cache. There are 3 tunables:
- zfs_vdev_cache_max: Defaults to 16KB; Reads smaller than this size will be inflated to zfs_vdev_cache_bshift.
- zfs_vdev_cache_size: Defaults to 10MB; Total size of the per-disk cache
- zfs_vdev_cache_bshift: Defaults to 16; this is a bit shift value, so 16 represents 64K. If you reduce the value to 13 it represents 8K.
So the idea is that if you go to all the trouble of moving the disk head(s) around to get some data, you might as well make it worth your while. If its a large I/O its just put into the cache until someone comes back for it, but if its a small I/O, just grab a bit more data while your already there and cache it just in case. Due to this design, even if you don’t come back for that data later, you aren’t loosing much if anything… you already had to accept the latency for the I/O anyway. The data is stored in a per-vdev (typically a physical disk) cache of 10MB. Unlike ARC, this is an old-school LRU (Least Recently Used) cache which just rolls data through. This cache applies only to reads, its a read-ahead cache, not a write cache (if you want that, see ZIL).
If you dig around in the code this read-ahead cache is implemented in the zfs_zfetch_* functions. Here is a sample dtrace script to watch it in action:
#!/usr/sbin/dtrace -qs
/*
* Used for monitoring vdev_cache (per-disk block prefetch)
* vdev_cache_read(); Returns 1 on cache hit, errno on miss.
* vdev_cache_allocate(); Creates new cache entry in LRU
* vdev_cache_write(); Update cache entry on write
* vdev_cache_evict(); Evict entry to make room for new...
* vdev_cache_purge(); Purge cache.
*/
fbt:zfs:vdev_cache_read:entry
{
self->io_offset = args[0]->io_offset;
self->io_size = args[0]->io_size;
self->vdev_guid = args[0]->io_vd->vdev_guid;
self->start = timestamp;
}
fbt:zfs:vdev_cache_read:return
/self->io_offset/
{
printf("%s: %d read %d bytes at offset %d: %sn", probefunc, self->vdev_guid, self->io_size, self->io_offset, args[1] == 0 ? "HIT" : "MISS");
}
Here’s what it looks like when it runs. Please note the long int is the GUID (Global Unique Identifier) of the disk in question, useful for figuring out whether I/O’s are the the same disk or multiple:
vdev_cache_read: 14901677279437128904 read 131072 bytes at offset 138394293760: MISS vdev_cache_read: 14901677279437128904 read 131072 bytes at offset 138394555904: MISS vdev_cache_read: 16158491400258519295 read 65536 bytes at offset 138394489856: MISS vdev_cache_read: 16158491400258519295 read 129536 bytes at offset 138394948608: MISS vdev_cache_read: 16158491400258519295 read 84992 bytes at offset 138395151360: MISS vdev_cache_read: 16158491400258519295 read 40960 bytes at offset 138395281408: MISS vdev_cache_read: 14901677279437128904 read 131072 bytes at offset 138394686976: MISS
Please note, I offer the above script for those looking to better understand the internals of vdev_cache, not as a production tool.
Now here’s the kicker… once upon a time the read-ahead cache cached all data blocks, but the performance hit was such that almost all I/O got pushed to 64K and the cache rolled over way too fast. So in Nevada Build 70 (snv_70) it was augmented to only cache on meta-data requests. In the ZIO pipeline, one of the stages is pulling the block pointer, at which point it will see if its meta-data, and if not it sets a DONT_CACHE flag.
Along with the snv_70 changes, kstats were added. The kstats are aggregated for all vdev_caches, so you don’t get per disk granularity, but they can be handy (or at least interesting) observability points:
benr@quadra ~$ kstat -p zfs:0:vdev_cache_stats zfs:0:vdev_cache_stats:class misc zfs:0:vdev_cache_stats:crtime 23.410498211 zfs:0:vdev_cache_stats:delegations 9960 zfs:0:vdev_cache_stats:hits 12725 zfs:0:vdev_cache_stats:misses 20865 zfs:0:vdev_cache_stats:snaptime 886853.023016843
So, you can tune the read-ahead cache, but should you? The answer is almost universally no. Of the 3 values above the only one I’ve seen tuned is bshift, changing from 16 to 13, thus only inflating writes to 8K. This should only be done if I/O is very expensive and extremely random and if your running snv_70 or older. NFS or Mail Servers are examples, or if your using iSCSI over a slow link.
File-Level Prefetch (DMU)
While the vdev read-ahead cache is implemented within the ZFS IO pipeline in the SPA layer, file level prefetch occurs up in the DMU layer and feeds data to the ARC (Adaptive Replacement Cache, the pool wide read cache). When you hear about ZFS’s intelligent prefetch, this is what they are talking about.
The prefetch is pretty simple, in a nutshell if read a block we prefetch the next block. If you read that block we just prefetched, we go and grab 2 blocks. Got those blocks? Grab 3. Its more intelligent than this, but you get the idea. This can go all the way up to prefetching 256 blocks at a time. This behavior really benefits sequential read streams, where the data is being prefetched in the ARC before you actually need it which means more efficient disk I/O and faster response. It actually gets even more complex than this, because it can determine if multiple co-linear streams are being prefetched and collapse them into a single strided prefetch. If you’ve seen the ZFS: The Last Word in File Systems presentation, this whats happening when multiple users watch the Matrix.
Tuning file-level prefetch is straight forward… you can leave it on (default) or turn it off via zfs_prefetch_disable. Should you? No, not unless your workload is extremely random, and always random. Because ZFS prefetches intelligently, the overhead is minimal because if it’s not useful it won’t ever prefetch much. If you’re doing random access to 8K or smaller files on iSCSI, you should benchmark with it turned off, otherwise let it be. In my experience disabling prefetch, even on heavily loaded systems, had little or no impact on reducing random physical disk I/O. When in doubt, keep it enabled.
Can you monitor file level prefetch? Of course! Look for the dmu_zfetch() functions, which are called out to be dbuf_read() calling out to dbuf_prefetch and then to the various dmu_zfetch_*() functions. Here is an example script:
#!/usr/sbin/dtrace -qs
/* This should look at the arguments to dmu_zfetch_fetch */
fbt:zfs:dmu_zfetch_fetch:entry
{
self->object = args[0]->dn_object;
self->blockid = arg1;
self->blocks = arg2;
self->start = timestamp;
printf("Prefetching object %d from dataset %s : %d blocks for blockid %dn",
args[0]->dn_objset->os_dsl_dataset->ds_dir->dd_object,
args[0]->dn_objset->os_dsl_dataset->ds_dir->dd_myname,
arg2, arg1);
}
fbt:zfs:dbuf_prefetch:entry
{
printf(" zfetching object %d block %d (%d bytes) from dataset %sn",
args[0]->dn_objset->os_dsl_dataset->ds_dir->dd_object,
arg1,
args[0]->dn_datablksz,
args[0]->dn_objset->os_dsl_dataset->ds_dir->dd_myname );
}
This script will watch for dbuf_prefetch to request a bunch of prefetch blocks, and then the dmu_zfetch_fetch function which actually goes and gets them:
Prefetching object 53 from dataset benr : 2 blocks for blockid 41 zfetching object 53 block 41 (131072 bytes) from dataset benr zfetching object 53 block 42 (131072 bytes) from dataset benr Prefetching object 53 from dataset benr : 3 blocks for blockid 43 zfetching object 53 block 43 (131072 bytes) from dataset benr zfetching object 53 block 44 (131072 bytes) from dataset benr zfetching object 53 block 45 (131072 bytes) from dataset benr Prefetching object 53 from dataset benr : 4 blocks for blockid 46 zfetching object 53 block 46 (131072 bytes) from dataset benr zfetching object 53 block 47 (131072 bytes) from dataset benr zfetching object 53 block 48 (131072 bytes) from dataset benr zfetching object 53 block 49 (131072 bytes) from dataset benr Prefetching object 53 from dataset benr : 4 blocks for blockid 50 zfetching object 53 block 50 (131072 bytes) from dataset benr zfetching object 53 block 51 (131072 bytes) from dataset benr zfetching object 53 block 52 (131072 bytes) from dataset benr zfetching object 53 block 53 (131072 bytes) from dataset benr Prefetching object 53 from dataset benr : 5 blocks for blockid 54 zfetching object 53 block 54 (131072 bytes) from dataset benr zfetching object 53 block 55 (131072 bytes) from dataset benr zfetching object 53 block 56 (131072 bytes) from dataset benr zfetching object 53 block 57 (131072 bytes) from dataset benr zfetching object 53 block 58 (131072 bytes) from dataset benr ...
Both these forms of prefetch add up to a very good way to maximize I/O activity. Are they perfect for every situation? No, the most common complain tends to be by databases which strictly work in fixed 8K blocks and manage their own caches very effectively. If you think you have such a case, file-level prefetch can be tuned on the fly using mdb, I encourage you to play with it and see what is best for your workload, but when in doubt leave it enabled. In common usage, both vdev read-ahead and file level prefetch can help make sure the I/O you may do has already been done and is sitting ready for you in DRAM.
The DTrace script mentioning fbt:zfs:dmu_zfetch_fetch:entry does not compile on Solaris 10:
dtrace: failed to compile script ./zfs_dmu_zfetch.d: line 7: operator -> cannot be applied to pointer to type “int”; must be applied to a struct or union pointer
and both DTrace scripts have n changed to just ‘n’.
Hi, Ben.
First of all, thanks for your blog in general. Much of what I know about
the under-the-hood workings of Solaris (not that it’s a lot at all) was
explained to me here
Concerning the “n”, I think Bob meant “backslash-n”, and apparently
it is stripped off the HTML markup by the blog. I hit this too, now
What kernel tunable parameters control the file-level prefetch?
My snv_114 system shows this behavior which puzzles me both
because it is fetching the same blocks (dunno, maybe it writes
them in-between) and the block size is smaller than in your samples:
zfetching object 4140 block 12 (16384 bytes) from dataset alf_3
zfetching object 4140 block 12 (16384 bytes) from dataset alf_3
zfetching object 4140 block 15 (16384 bytes) from dataset alf_3
zfetching object 4140 block 15 (16384 bytes) from dataset alf_3
zfetching object 4140 block 13 (16384 bytes) from dataset alf_3
zfetching object 4140 block 12 (16384 bytes) from dataset alf_3
or
zfetching object 319 block 820 (16384 bytes) from dataset atlas
zfetching object 319 block 820 (16384 bytes) from dataset atlas
zfetching object 319 block 820 (16384 bytes) from dataset atlas
zfetching object 319 block 819 (16384 bytes) from dataset atlas
zfetching object 319 block 821 (16384 bytes) from dataset atlas
zfetching object 319 block 822 (16384 bytes) from dataset atlas
zfetching object 319 block 820 (16384 bytes) from dataset atlas
zfetching object 319 block 821 (16384 bytes) from dataset atlas
zfetching object 319 block 821 (16384 bytes) from dataset atlas
Thanks again,
//Jim Klimov
hair products in the market and 100% Quality Guarantee.
Good post! Thanks you for your information! China Wholesale Wholesale China Wholesalers Wholesale Game Accessories Wholesale Iphone Accessories Video Game Accessories Wholesale Wholesale Wii Accessories Wholesale Xbox 360 Accessories Wholesale Xbox 360 Games Wholesale Video Games Cheap Video Games Cheap Ps3 Games Cheap Xbox 360 Games Wholesale Computers Wholesale Laptop Computers Wholesale Laptops Discount Computers Cheap Computers Wholesale Iphones Wholesale Iphone Wholesale Iphones 3g Hiphones Wholesale Hiphone Wholesale Hiphones Wholesale Nokia Wholesale Nokia 8800 Wholesale Nokia n97 wholesale blackberry wholesale blackberry phones wholesale blackberry 9700 wholesale blackberry 9600 wholesale blackberry 9500
Good post! Thanks for your information! ed hardy ed hardy ed hardy clothing ed hardy clothing ed hardy swimwear ed hardy swimwear ed hardy jeans ed hardy jeans ed hardy hoodies ed hardy hoodies ed hardy shoes ed hardy shoes ed hardy uk ed hardy uk ed hardy bags ed hardy bags ed hardy shirts ed hardy shirts christian audigier christian audigier ed hardy mens ed hardy mens ed hardy womens ed hardy womens ed hardy kids ed hardy kids ed hardy
it is a beautiful article .i think so . UGG Boots UK| become the
favorite thing in all the girls’ eyes,do you know about that?
Well,This is a good post.do you knowCHI flat iron chi hair straightener has already
an objective market share in market, and will be constantly
enhanced in future.
ok,so nice .GHD hair straightenerhas been
a real blessing for many women with curly hair who want to have some
control over their mop of hair.
http://www.buykamagra.com buy kamagra
http://www.viagracialis.com viagra cialis
I will make sure and bookmark this page, I will come back to follow you more.
Thanks !
Good post! Thanks for your information!
Everything will be all right,I am behind you.
Good post! Thanks for your information! As Seen On TV
http://www.mbts.us
http://www.flatiron4u.com
come to visit
Your article is very useful!
There are many Tiffany jewellery renowned designers of engagement rings from different countries. As usual, I loveTiffany Charms the simplicity of the design. This dress has a bodice that is strapless, and has a very subtle pleating. 00. tiffanyTiffany Bracelet rings are Tiffany Sets exceptional Tiffany Rings and extremely beautiful. The Tiffany Earrings company Tiffany Necklace will have a much tougher battle in Europe, however, where tiffany has opened six stores to sell tiffany necklace and Tiffany & co since 1986. Everyone has many accessories whether men or women. Make sure that you buy it early because “Tiffany” dresses Tiffany Pendant usually sell out pretty quickly. It’s not too fancy for everyday and Tiffany Accessories not to casual when dressed up.
http://www.wulffmorgenthaler.com/strip.aspx?id=2da2a174-a36c-4bee-900e-9a66f8ef6fe7
Your article is very useful!Thank you for sharing.Nice post.
yes,cometo post it more
Scalp Care Tips for Men
There’s no such thing as healthy GHD hair. hair is basically protein and has no nervous system or blood supply and, as such, does not have reparative properties. The scalp, on the other hand, does. Since your hair is an extension of the scalp, it makes sense that a healthy scalp helps ensure healthy looking hair. For keeping that scalp healthy, follow these quick and easy scalp care tips.
[[http://www.watch-onsale.com]]
Those drastically good. Love to read and learn how it goes..
Houses and cars are expensive and not everyone is able to buy it. However, business loans was invented to help different people in such kind of situations.
nice to be here and we glad to read your artikle
nowGoogle.com adalah Multiple Search Engine Popular memiliki Fungsi Multiple Search Engine hadir di Catatan Si Bongo dengan nama nowGoogle.com
he content of the articles there will be a lot of attractive people to appreciate, I have to thank you such an article. As hard as it is to answer those questions, it would definitely not help by burying our head in the sand and refusing to talk about them
designer gucci handbags
I have to thank you such an article. As hard as it is to answer those questions, it would definitely not help by burying our head in the sand and refusing to talk about them
if you want to buy here is some good Website
for another you can see this any more
You give it nike air max your all and you’ll get it all back in the Nike Trainer Manny Pacquiao SC
2010 Men’s Training Shoe, featuring a lightweight design and plenty of extra support for
making the most of your demanding regimen.
Inspired by Manny Pacquiao
Synthetic leather upper with perforations for breathability
Flywire for ultra-lightweight support and comfort
Nike Air cushioning for a comfortable ride
Rubber outsole with DiamondFLX technology forair max shoes multidirectional flexibility
I very intersted in the article
efgrhtykip;
great post
http://www.sellnikesbs.com
china wholesale USB mouse
dell inspiron e1705 laptop battery http://www.adapterlist.com/dell/inspiron-e1705.htm
Panasonic VW-VBG260 Battery http://www.globallaptopbattery.co.uk/camcorder-battery/panasonic/VW-VBG260.htm
balance, posture, and varicose veins. When you go shopping, with the GHD hair straightener making a simple hairstyle is very important. (ghdhairlyp) [url=http://www.mydiscountjordanshoes.com/air-jordan-xv-c-79.html] air-jordan-xv
[/url]
it is good!
http://www.mbtshoeslatest.com
http://www.nikeairmaxshoe.com
The athletic shoes which makes using this technology may the very good local constable convoy mobilization body, Air Max 2009.
http://www.allhotshoes.com/
http://www.cheapjerseyschina.com
http://www.jerseyscloset.com
http://www.dvdtoipadmacos.com DVD to iPad Mac is only a mac os x software that can help mac users easily convert DVD
to iPad supported video formats (mp4, m4v, mov), Which user can sync the converted videos to iPad with itunes. this Mac DVD to iPad
converter support almost all kind of DVD disk, like DVD5 (single layer, double layer), DVD9(single layer, double layer) , Either
copied by yourself or purchased DVD movie from market, the DVD to iPad mac all can do. The only short back of this DVD to iPad mac
is that it does support BD disk, if you happenly want to convert a blu-ray DVD movie to iPad, I suggest use this Blu-ray Ripper for replace.
Buy a piece of ghd for yourself. Come and join us http://www.ghdiron-outlet.com/ to win the cheap ghd.
As we know, now GHD are loved by more and more people, which will save up to 45%.welcome to http://www.ghdoutlet-au.com/.
Come and join us http://www.ghdoutlet-uk.com/index.php to win the ghd iv styler.
http://www.pdftoimageconverter.com PDF to IMAGE Converter with reliable quality and humanized design is your ideal helper, which can protect U from having troubles in converting pdf to image! Unimaginable functions will not let U down forever!
special multi-layered soles that are designed to change the way you walk,so you workout your entire body while you wear them. They’re supposed to tone your legs, back and stomach, and improve balance, posture, and varicose veins. When you go shopping, with the GHD hair straightener making a simple hairstyle is very important. (ghdhairlyp)
excellent article , I added you to my http://www.china-wholesale-directory.com Top China Wholesalers category.. thanks for sharing the article!
very cool article ,thanks for sharing the article!like my cool stuff .very useful.
uCoolStuff is the leading China wholesaler for http://www.ucoolstuff.com cool stuff http://www.ucoolstuff.com cool gifts , unusual gadgets and other unique gift ideas. We provide the very latest cool stuff and cool gifts for you
Top Online Stores is a SEO Friendly http://www.toponlinestores.org free directory where you can find the best online shopping stores selected by hand and sorted by category http://www.china-wholesale-directory.com china wholesale .
discount tiffany jewelry
Find cheap tiffany Jewelry at http://www.silver-bulk.com
They’re supposed to tone your legs, back and stomach, and improve balance, posture, and varicose veins. When you go shopping, with the GHD hair straightener making a simple hairstyle is very important. (ghdhairlyp)
GHD hair straighteners is a fashion tool and good to travel with. GHD Australia can provide many kinds of ghd styler to you. Such as cheap ghd, ghd iv styler, purple ghd etc. Welcome to ghd outlet store http://www.ghdstyle-au.com
Hua 101211 Once we reach there, so many wonderful north face Denali hoodie dreams will come true and the pieces of our lives will be fit together like a completed jigsaw puzzle. How restlessly we pace the north face jackets sale aisles, damning the minutes loitering, waiting, waiting, and waiting for the station. http://www.rosettasale.com/