Understanding ZFS: Prefetch

One of the great mysteries of ZFS is prefetch. Failing to understand these, how they work, and what they intend to do for you, can cause a lot of confusion, so here we’ll dig our fingers into the subject. The first thing to understand is that “ZFS Prefetch” may refer to file-level prefetch and/or the virtual device read-ahead cache; we’ll discuss both here.

VDev Read-Ahead Cache (SPA)

When reading data from spinning media, the bulk of the service time is spent positioning the disk head. Once the head is in place, actually reading data is super speedy. To capitalize on this reality, ZFS’s vdev_cache is a virtual device read ahead cache. There are 3 tunables:

  • zfs_vdev_cache_max: Defaults to 16KB; Reads smaller than this size will be inflated to zfs_vdev_cache_bshift.
  • zfs_vdev_cache_size: Defaults to 10MB; Total size of the per-disk cache
  • zfs_vdev_cache_bshift: Defaults to 16; this is a bit shift value, so 16 represents 64K. If you reduce the value to 13 it represents 8K.

So the idea is that if you go to all the trouble of moving the disk head(s) around to get some data, you might as well make it worth your while. If its a large I/O its just put into the cache until someone comes back for it, but if its a small I/O, just grab a bit more data while your already there and cache it just in case. Due to this design, even if you don’t come back for that data later, you aren’t loosing much if anything… you already had to accept the latency for the I/O anyway. The data is stored in a per-vdev (typically a physical disk) cache of 10MB. Unlike ARC, this is an old-school LRU (Least Recently Used) cache which just rolls data through. This cache applies only to reads, its a read-ahead cache, not a write cache (if you want that, see ZIL).

If you dig around in the code this read-ahead cache is implemented in the zfs_zfetch_* functions. Here is a sample dtrace script to watch it in action:

#!/usr/sbin/dtrace -qs

/*
 * Used for monitoring vdev_cache (per-disk block prefetch)
 *   vdev_cache_read(); Returns 1 on cache hit, errno on miss.
 *   vdev_cache_allocate(); Creates new cache entry in LRU
 *   vdev_cache_write(); Update cache entry on write
 *   vdev_cache_evict(); Evict entry to make room for new...
 *   vdev_cache_purge(); Purge cache.
 */

fbt:zfs:vdev_cache_read:entry
{
	self->io_offset = args[0]->io_offset;
	self->io_size = args[0]->io_size;
	self->vdev_guid = args[0]->io_vd->vdev_guid;
	self->start = timestamp;
}

fbt:zfs:vdev_cache_read:return
/self->io_offset/
{
	printf("%s: %d read %d bytes at offset %d: %sn", probefunc, self->vdev_guid, self->io_size, self->io_offset, args[1] == 0 ? "HIT" : "MISS");
}

Here’s what it looks like when it runs. Please note the long int is the GUID (Global Unique Identifier) of the disk in question, useful for figuring out whether I/O’s are the the same disk or multiple:

vdev_cache_read: 14901677279437128904 read 131072 bytes at offset 138394293760: MISS
vdev_cache_read: 14901677279437128904 read 131072 bytes at offset 138394555904: MISS
vdev_cache_read: 16158491400258519295 read 65536 bytes at offset 138394489856: MISS
vdev_cache_read: 16158491400258519295 read 129536 bytes at offset 138394948608: MISS
vdev_cache_read: 16158491400258519295 read 84992 bytes at offset 138395151360: MISS
vdev_cache_read: 16158491400258519295 read 40960 bytes at offset 138395281408: MISS
vdev_cache_read: 14901677279437128904 read 131072 bytes at offset 138394686976: MISS

Please note, I offer the above script for those looking to better understand the internals of vdev_cache, not as a production tool.

Now here’s the kicker… once upon a time the read-ahead cache cached all data blocks, but the performance hit was such that almost all I/O got pushed to 64K and the cache rolled over way too fast. So in Nevada Build 70 (snv_70) it was augmented to only cache on meta-data requests. In the ZIO pipeline, one of the stages is pulling the block pointer, at which point it will see if its meta-data, and if not it sets a DONT_CACHE flag.

Along with the snv_70 changes, kstats were added. The kstats are aggregated for all vdev_caches, so you don’t get per disk granularity, but they can be handy (or at least interesting) observability points:

benr@quadra ~$ kstat -p zfs:0:vdev_cache_stats
zfs:0:vdev_cache_stats:class    misc
zfs:0:vdev_cache_stats:crtime   23.410498211
zfs:0:vdev_cache_stats:delegations      9960
zfs:0:vdev_cache_stats:hits     12725
zfs:0:vdev_cache_stats:misses   20865
zfs:0:vdev_cache_stats:snaptime 886853.023016843

So, you can tune the read-ahead cache, but should you? The answer is almost universally no. Of the 3 values above the only one I’ve seen tuned is bshift, changing from 16 to 13, thus only inflating writes to 8K. This should only be done if I/O is very expensive and extremely random and if your running snv_70 or older. NFS or Mail Servers are examples, or if your using iSCSI over a slow link.

File-Level Prefetch (DMU)

While the vdev read-ahead cache is implemented within the ZFS IO pipeline in the SPA layer, file level prefetch occurs up in the DMU layer and feeds data to the ARC (Adaptive Replacement Cache, the pool wide read cache). When you hear about ZFS’s intelligent prefetch, this is what they are talking about.

The prefetch is pretty simple, in a nutshell if read a block we prefetch the next block. If you read that block we just prefetched, we go and grab 2 blocks. Got those blocks? Grab 3. Its more intelligent than this, but you get the idea. This can go all the way up to prefetching 256 blocks at a time. This behavior really benefits sequential read streams, where the data is being prefetched in the ARC before you actually need it which means more efficient disk I/O and faster response. It actually gets even more complex than this, because it can determine if multiple co-linear streams are being prefetched and collapse them into a single strided prefetch. If you’ve seen the ZFS: The Last Word in File Systems presentation, this whats happening when multiple users watch the Matrix.

Tuning file-level prefetch is straight forward… you can leave it on (default) or turn it off via zfs_prefetch_disable. Should you? No, not unless your workload is extremely random, and always random. Because ZFS prefetches intelligently, the overhead is minimal because if it’s not useful it won’t ever prefetch much. If you’re doing random access to 8K or smaller files on iSCSI, you should benchmark with it turned off, otherwise let it be. In my experience disabling prefetch, even on heavily loaded systems, had little or no impact on reducing random physical disk I/O. When in doubt, keep it enabled.

Can you monitor file level prefetch? Of course! Look for the dmu_zfetch() functions, which are called out to be dbuf_read() calling out to dbuf_prefetch and then to the various dmu_zfetch_*() functions. Here is an example script:

#!/usr/sbin/dtrace -qs 

/* This should look at the arguments to dmu_zfetch_fetch */
fbt:zfs:dmu_zfetch_fetch:entry
{
        self->object = args[0]->dn_object;
        self->blockid = arg1;
        self->blocks  = arg2;
        self->start = timestamp;
        printf("Prefetching object %d from dataset %s : %d blocks for blockid %dn",
			args[0]->dn_objset->os_dsl_dataset->ds_dir->dd_object,
			args[0]->dn_objset->os_dsl_dataset->ds_dir->dd_myname,
			arg2, arg1);
}

fbt:zfs:dbuf_prefetch:entry
{
	printf("  zfetching object %d block %d (%d bytes) from dataset %sn",
			args[0]->dn_objset->os_dsl_dataset->ds_dir->dd_object,
			arg1,
			args[0]->dn_datablksz,
			args[0]->dn_objset->os_dsl_dataset->ds_dir->dd_myname  );
}

This script will watch for dbuf_prefetch to request a bunch of prefetch blocks, and then the dmu_zfetch_fetch function which actually goes and gets them:

Prefetching object 53 from dataset benr : 2 blocks for blockid 41
  zfetching object 53 block 41 (131072 bytes) from dataset benr
  zfetching object 53 block 42 (131072 bytes) from dataset benr
Prefetching object 53 from dataset benr : 3 blocks for blockid 43
  zfetching object 53 block 43 (131072 bytes) from dataset benr
  zfetching object 53 block 44 (131072 bytes) from dataset benr
  zfetching object 53 block 45 (131072 bytes) from dataset benr
Prefetching object 53 from dataset benr : 4 blocks for blockid 46
  zfetching object 53 block 46 (131072 bytes) from dataset benr
  zfetching object 53 block 47 (131072 bytes) from dataset benr
  zfetching object 53 block 48 (131072 bytes) from dataset benr
  zfetching object 53 block 49 (131072 bytes) from dataset benr
Prefetching object 53 from dataset benr : 4 blocks for blockid 50
  zfetching object 53 block 50 (131072 bytes) from dataset benr
  zfetching object 53 block 51 (131072 bytes) from dataset benr
  zfetching object 53 block 52 (131072 bytes) from dataset benr
  zfetching object 53 block 53 (131072 bytes) from dataset benr
Prefetching object 53 from dataset benr : 5 blocks for blockid 54
  zfetching object 53 block 54 (131072 bytes) from dataset benr
  zfetching object 53 block 55 (131072 bytes) from dataset benr
  zfetching object 53 block 56 (131072 bytes) from dataset benr
  zfetching object 53 block 57 (131072 bytes) from dataset benr
  zfetching object 53 block 58 (131072 bytes) from dataset benr
...

Both these forms of prefetch add up to a very good way to maximize I/O activity. Are they perfect for every situation? No, the most common complain tends to be by databases which strictly work in fixed 8K blocks and manage their own caches very effectively. If you think you have such a case, file-level prefetch can be tuned on the fly using mdb, I encourage you to play with it and see what is best for your workload, but when in doubt leave it enabled. In common usage, both vdev read-ahead and file level prefetch can help make sure the I/O you may do has already been done and is sitting ready for you in DRAM.

58 Responses to “Understanding ZFS: Prefetch”

  1. Bob Friesenhahn says:

    The DTrace script mentioning fbt:zfs:dmu_zfetch_fetch:entry does not compile on Solaris 10:
    dtrace: failed to compile script ./zfs_dmu_zfetch.d: line 7: operator -> cannot be applied to pointer to type “int”; must be applied to a struct or union pointer
    and both DTrace scripts have n changed to just ‘n’.

  2. Jim Klimov says:

    Hi, Ben.

    First of all, thanks for your blog in general. Much of what I know about
    the under-the-hood workings of Solaris (not that it’s a lot at all) was
    explained to me here ;)

    Concerning the “n”, I think Bob meant “backslash-n”, and apparently
    it is stripped off the HTML markup by the blog. I hit this too, now ;)

    What kernel tunable parameters control the file-level prefetch?
    My snv_114 system shows this behavior which puzzles me both
    because it is fetching the same blocks (dunno, maybe it writes
    them in-between) and the block size is smaller than in your samples:

    zfetching object 4140 block 12 (16384 bytes) from dataset alf_3
    zfetching object 4140 block 12 (16384 bytes) from dataset alf_3
    zfetching object 4140 block 15 (16384 bytes) from dataset alf_3
    zfetching object 4140 block 15 (16384 bytes) from dataset alf_3
    zfetching object 4140 block 13 (16384 bytes) from dataset alf_3
    zfetching object 4140 block 12 (16384 bytes) from dataset alf_3

    or

    zfetching object 319 block 820 (16384 bytes) from dataset atlas
    zfetching object 319 block 820 (16384 bytes) from dataset atlas
    zfetching object 319 block 820 (16384 bytes) from dataset atlas
    zfetching object 319 block 819 (16384 bytes) from dataset atlas
    zfetching object 319 block 821 (16384 bytes) from dataset atlas
    zfetching object 319 block 822 (16384 bytes) from dataset atlas
    zfetching object 319 block 820 (16384 bytes) from dataset atlas
    zfetching object 319 block 821 (16384 bytes) from dataset atlas
    zfetching object 319 block 821 (16384 bytes) from dataset atlas

    Thanks again,
    //Jim Klimov

  3. hair products in the market and 100% Quality Guarantee.

  4. raging bull says:

    Good post! Thanks you for your information! China Wholesale Wholesale China Wholesalers Wholesale Game Accessories Wholesale Iphone Accessories Video Game Accessories Wholesale Wholesale Wii Accessories Wholesale Xbox 360 Accessories Wholesale Xbox 360 Games Wholesale Video Games Cheap Video Games Cheap Ps3 Games Cheap Xbox 360 Games Wholesale Computers Wholesale Laptop Computers Wholesale Laptops Discount Computers Cheap Computers Wholesale Iphones Wholesale Iphone Wholesale Iphones 3g Hiphones Wholesale Hiphone Wholesale Hiphones Wholesale Nokia Wholesale Nokia 8800 Wholesale Nokia n97 wholesale blackberry wholesale blackberry phones wholesale blackberry 9700 wholesale blackberry 9600 wholesale blackberry 9500

  5. ed hardy says:

    Good post! Thanks for your information! ed hardy ed hardy ed hardy clothing ed hardy clothing ed hardy swimwear ed hardy swimwear ed hardy jeans ed hardy jeans ed hardy hoodies ed hardy hoodies ed hardy shoes ed hardy shoes ed hardy uk ed hardy uk ed hardy bags ed hardy bags ed hardy shirts ed hardy shirts christian audigier christian audigier ed hardy mens ed hardy mens ed hardy womens ed hardy womens ed hardy kids ed hardy kids ed hardy

  6. ugg boots UK says:

    it is a beautiful article .i think so . UGG Boots UK| become the
    favorite thing in all the girls’ eyes,do you know about that?

  7. Well,This is a good post.do you knowCHI flat iron chi hair straightener has already
    an objective market share in market, and will be constantly
    enhanced in future.

  8. ok,so nice .GHD hair straightenerhas been
    a real blessing for many women with curly hair who want to have some
    control over their mop of hair.

  9. I will make sure and bookmark this page, I will come back to follow you more.

  10. Good post! Thanks for your information!

  11. iphone says:

    Everything will be all right,I am behind you.

  12. Good post! Thanks for your information! As Seen On TV

  13. ed hardy says:

    Your article is very useful!

  14. happy23 says:

    There are many Tiffany jewellery renowned designers of engagement rings from different countries. As usual, I loveTiffany Charms the simplicity of the design. This dress has a bodice that is strapless, and has a very subtle pleating. 00. tiffanyTiffany Bracelet rings are Tiffany Sets exceptional Tiffany Rings and extremely beautiful. The Tiffany Earrings company Tiffany Necklace will have a much tougher battle in Europe, however, where tiffany has opened six stores to sell tiffany necklace and Tiffany & co since 1986. Everyone has many accessories whether men or women. Make sure that you buy it early because “Tiffany” dresses Tiffany Pendant usually sell out pretty quickly. It’s not too fancy for everyday and Tiffany Accessories not to casual when dressed up.

  15. Your article is very useful!Thank you for sharing.Nice post.

  16. yes,cometo post it more

  17. ghd hair says:

    Scalp Care Tips for Men
    There’s no such thing as healthy GHD hair. hair is basically protein and has no nervous system or blood supply and, as such, does not have reparative properties. The scalp, on the other hand, does. Since your hair is an extension of the scalp, it makes sense that a healthy scalp helps ensure healthy looking hair. For keeping that scalp healthy, follow these quick and easy scalp care tips.

  18. [[http://www.watch-onsale.com]]

  19. Those drastically good. Love to read and learn how it goes..

  20. Houses and cars are expensive and not everyone is able to buy it. However, business loans was invented to help different people in such kind of situations.

  21. nice to be here and we glad to read your artikle

  22. nowGoogle.com adalah Multiple Search Engine Popular memiliki Fungsi Multiple Search Engine hadir di Catatan Si Bongo dengan nama nowGoogle.com

  23. he content of the articles there will be a lot of attractive people to appreciate, I have to thank you such an article. As hard as it is to answer those questions, it would definitely not help by burying our head in the sand and refusing to talk about them

  24. I have to thank you such an article. As hard as it is to answer those questions, it would definitely not help by burying our head in the sand and refusing to talk about them

  25. gucci bags says:

    if you want to buy here is some good Website
    for another you can see this any more

  26. dear says:

    You give it nike air max your all and you’ll get it all back in the Nike Trainer Manny Pacquiao SC

    2010 Men’s Training Shoe, featuring a lightweight design and plenty of extra support for

    making the most of your demanding regimen.

    Inspired by Manny Pacquiao
    Synthetic leather upper with perforations for breathability
    Flywire for ultra-lightweight support and comfort
    Nike Air cushioning for a comfortable ride
    Rubber outsole with DiamondFLX technology forair max shoes multidirectional flexibility

  27. I very intersted in the article

  28. china wholesale USB mouse

  29. balance, posture, and varicose veins. When you go shopping, with the GHD hair straightener making a simple hairstyle is very important. (ghdhairlyp) [url=http://www.mydiscountjordanshoes.com/air-jordan-xv-c-79.html] air-jordan-xv

    [/url]

  30. The athletic shoes which makes using this technology may the very good local constable convoy mobilization body, Air Max 2009.
    http://www.allhotshoes.com/

  31. http://www.dvdtoipadmacos.com DVD to iPad Mac is only a mac os x software that can help mac users easily convert DVD
    to iPad supported video formats (mp4, m4v, mov), Which user can sync the converted videos to iPad with itunes. this Mac DVD to iPad
    converter support almost all kind of DVD disk, like DVD5 (single layer, double layer), DVD9(single layer, double layer) , Either
    copied by yourself or purchased DVD movie from market, the DVD to iPad mac all can do. The only short back of this DVD to iPad mac
    is that it does support BD disk, if you happenly want to convert a blu-ray DVD movie to iPad, I suggest use this Blu-ray Ripper for replace.

  32. ghd outlet says:

    Buy a piece of ghd for yourself. Come and join us http://www.ghdiron-outlet.com/ to win the cheap ghd.

  33. As we know, now GHD are loved by more and more people, which will save up to 45%.welcome to http://www.ghdoutlet-au.com/.

  34. http://www.pdftoimageconverter.com PDF to IMAGE Converter with reliable quality and humanized design is your ideal helper, which can protect U from having troubles in converting pdf to image! Unimaginable functions will not let U down forever!

  35. special multi-layered soles that are designed to change the way you walk,so you workout your entire body while you wear them. They’re supposed to tone your legs, back and stomach, and improve balance, posture, and varicose veins. When you go shopping, with the GHD hair straightener making a simple hairstyle is very important. (ghdhairlyp)

  36. excellent article , I added you to my http://www.china-wholesale-directory.com Top China Wholesalers category.. thanks for sharing the article!

  37. cool stuff says:

    very cool article ,thanks for sharing the article!like my cool stuff .very useful.
    uCoolStuff is the leading China wholesaler for http://www.ucoolstuff.com cool stuff http://www.ucoolstuff.com cool gifts , unusual gadgets and other unique gift ideas. We provide the very latest cool stuff and cool gifts for you

  38. Top Online Stores is a SEO Friendly http://www.toponlinestores.org free directory where you can find the best online shopping stores selected by hand and sorted by category http://www.china-wholesale-directory.com china wholesale .

  39. They’re supposed to tone your legs, back and stomach, and improve balance, posture, and varicose veins. When you go shopping, with the GHD hair straightener making a simple hairstyle is very important. (ghdhairlyp)

  40. ghd outlet says:

    GHD hair straighteners is a fashion tool and good to travel with. GHD Australia can provide many kinds of ghd styler to you. Such as cheap ghd, ghd iv styler, purple ghd etc. Welcome to ghd outlet store http://www.ghdstyle-au.com

  41. Hua 101211 Once we reach there, so many wonderful north face Denali hoodie dreams will come true and the pieces of our lives will be fit together like a completed jigsaw puzzle. How restlessly we pace the north face jackets sale aisles, damning the minutes loitering, waiting, waiting, and waiting for the station. http://www.rosettasale.com/