Understanding ZFS: Prefetch

Posted on May 14, 2009

One of the great mysteries of ZFS is prefetch. Failing to understand these, how they work, and what they intend to do for you, can cause a lot of confusion, so here we’ll dig our fingers into the subject. The first thing to understand is that “ZFS Prefetch” may refer to file-level prefetch and/or the virtual device read-ahead cache; we’ll discuss both here.

VDev Read-Ahead Cache (SPA)

When reading data from spinning media, the bulk of the service time is spent positioning the disk head. Once the head is in place, actually reading data is super speedy. To capitalize on this reality, ZFS’s vdev_cache is a virtual device read ahead cache. There are 3 tunables:

  • zfs_vdev_cache_max: Defaults to 16KB; Reads smaller than this size will be inflated to zfs_vdev_cache_bshift.
  • zfs_vdev_cache_size: Defaults to 10MB; Total size of the per-disk cache
  • zfs_vdev_cache_bshift: Defaults to 16; this is a bit shift value, so 16 represents 64K. If you reduce the value to 13 it represents 8K.

So the idea is that if you go to all the trouble of moving the disk head(s) around to get some data, you might as well make it worth your while. If its a large I/O its just put into the cache until someone comes back for it, but if its a small I/O, just grab a bit more data while your already there and cache it just in case. Due to this design, even if you don’t come back for that data later, you aren’t loosing much if anything… you already had to accept the latency for the I/O anyway. The data is stored in a per-vdev (typically a physical disk) cache of 10MB. Unlike ARC, this is an old-school LRU (Least Recently Used) cache which just rolls data through. This cache applies only to reads, its a read-ahead cache, not a write cache (if you want that, see ZIL).

If you dig around in the code this read-ahead cache is implemented in the zfs_zfetch_* functions. Here is a sample dtrace script to watch it in action:

#!/usr/sbin/dtrace -qs

 * Used for monitoring vdev_cache (per-disk block prefetch)
 *   vdev_cache_read(); Returns 1 on cache hit, errno on miss.
 *   vdev_cache_allocate(); Creates new cache entry in LRU
 *   vdev_cache_write(); Update cache entry on write
 *   vdev_cache_evict(); Evict entry to make room for new...
 *   vdev_cache_purge(); Purge cache.

	self->io_offset = args[0]->io_offset;
	self->io_size = args[0]->io_size;
	self->vdev_guid = args[0]->io_vd->vdev_guid;
	self->start = timestamp;

	printf("%s: %d read %d bytes at offset %d: %sn", probefunc, self->vdev_guid, self->io_size, self->io_offset, args[1] == 0 ? "HIT" : "MISS"); 

Here’s what it looks like when it runs. Please note the long int is the GUID (Global Unique Identifier) of the disk in question, useful for figuring out whether I/O’s are the the same disk or multiple:

vdev_cache_read: 14901677279437128904 read 131072 bytes at offset 138394293760: MISS
vdev_cache_read: 14901677279437128904 read 131072 bytes at offset 138394555904: MISS
vdev_cache_read: 16158491400258519295 read 65536 bytes at offset 138394489856: MISS
vdev_cache_read: 16158491400258519295 read 129536 bytes at offset 138394948608: MISS
vdev_cache_read: 16158491400258519295 read 84992 bytes at offset 138395151360: MISS
vdev_cache_read: 16158491400258519295 read 40960 bytes at offset 138395281408: MISS
vdev_cache_read: 14901677279437128904 read 131072 bytes at offset 138394686976: MISS

Please note, I offer the above script for those looking to better understand the internals of vdev_cache, not as a production tool.

Now here’s the kicker… once upon a time the read-ahead cache cached all data blocks, but the performance hit was such that almost all I/O got pushed to 64K and the cache rolled over way too fast. So in Nevada Build 70 (snv_70) it was augmented to only cache on meta-data requests. In the ZIO pipeline, one of the stages is pulling the block pointer, at which point it will see if its meta-data, and if not it sets a DONT_CACHE flag.

Along with the snv_70 changes, kstats were added. The kstats are aggregated for all vdev_caches, so you don’t get per disk granularity, but they can be handy (or at least interesting) observability points:

benr@quadra ~$ kstat -p zfs:0:vdev_cache_stats
zfs:0:vdev_cache_stats:class    misc
zfs:0:vdev_cache_stats:crtime   23.410498211
zfs:0:vdev_cache_stats:delegations      9960
zfs:0:vdev_cache_stats:hits     12725
zfs:0:vdev_cache_stats:misses   20865
zfs:0:vdev_cache_stats:snaptime 886853.023016843

So, you can tune the read-ahead cache, but should you? The answer is almost universally no. Of the 3 values above the only one I’ve seen tuned is bshift, changing from 16 to 13, thus only inflating writes to 8K. This should only be done if I/O is very expensive and extremely random and if your running snv_70 or older. NFS or Mail Servers are examples, or if your using iSCSI over a slow link.

File-Level Prefetch (DMU)

While the vdev read-ahead cache is implemented within the ZFS IO pipeline in the SPA layer, file level prefetch occurs up in the DMU layer and feeds data to the ARC (Adaptive Replacement Cache, the pool wide read cache). When you hear about ZFS’s intelligent prefetch, this is what they are talking about.

The prefetch is pretty simple, in a nutshell if read a block we prefetch the next block. If you read that block we just prefetched, we go and grab 2 blocks. Got those blocks? Grab 3. Its more intelligent than this, but you get the idea. This can go all the way up to prefetching 256 blocks at a time. This behavior really benefits sequential read streams, where the data is being prefetched in the ARC before you actually need it which means more efficient disk I/O and faster response. It actually gets even more complex than this, because it can determine if multiple co-linear streams are being prefetched and collapse them into a single strided prefetch. If you’ve seen the ZFS: The Last Word in File Systems presentation, this whats happening when multiple users watch the Matrix.

Tuning file-level prefetch is straight forward… you can leave it on (default) or turn it off via zfs_prefetch_disable. Should you? No, not unless your workload is extremely random, and always random. Because ZFS prefetches intelligently, the overhead is minimal because if it’s not useful it won’t ever prefetch much. If you’re doing random access to 8K or smaller files on iSCSI, you should benchmark with it turned off, otherwise let it be. In my experience disabling prefetch, even on heavily loaded systems, had little or no impact on reducing random physical disk I/O. When in doubt, keep it enabled.

Can you monitor file level prefetch? Of course! Look for the dmu_zfetch() functions, which are called out to be dbuf_read() calling out to dbuf_prefetch and then to the various dmu_zfetch_*() functions. Here is an example script:

#!/usr/sbin/dtrace -qs 

/* This should look at the arguments to dmu_zfetch_fetch */
        self->object = args[0]->dn_object;
        self->blockid = arg1;
        self->blocks  = arg2;
        self->start = timestamp;
        printf("Prefetching object %d from dataset %s : %d blocks for blockid %dn", 
			arg2, arg1); 

	printf("  zfetching object %d block %d (%d bytes) from dataset %sn", 
			args[0]->dn_objset->os_dsl_dataset->ds_dir->dd_myname  );

This script will watch for dbuf_prefetch to request a bunch of prefetch blocks, and then the dmu_zfetch_fetch function which actually goes and gets them:

Prefetching object 53 from dataset benr : 2 blocks for blockid 41
  zfetching object 53 block 41 (131072 bytes) from dataset benr
  zfetching object 53 block 42 (131072 bytes) from dataset benr
Prefetching object 53 from dataset benr : 3 blocks for blockid 43
  zfetching object 53 block 43 (131072 bytes) from dataset benr
  zfetching object 53 block 44 (131072 bytes) from dataset benr
  zfetching object 53 block 45 (131072 bytes) from dataset benr
Prefetching object 53 from dataset benr : 4 blocks for blockid 46
  zfetching object 53 block 46 (131072 bytes) from dataset benr
  zfetching object 53 block 47 (131072 bytes) from dataset benr
  zfetching object 53 block 48 (131072 bytes) from dataset benr
  zfetching object 53 block 49 (131072 bytes) from dataset benr
Prefetching object 53 from dataset benr : 4 blocks for blockid 50
  zfetching object 53 block 50 (131072 bytes) from dataset benr
  zfetching object 53 block 51 (131072 bytes) from dataset benr
  zfetching object 53 block 52 (131072 bytes) from dataset benr
  zfetching object 53 block 53 (131072 bytes) from dataset benr
Prefetching object 53 from dataset benr : 5 blocks for blockid 54
  zfetching object 53 block 54 (131072 bytes) from dataset benr
  zfetching object 53 block 55 (131072 bytes) from dataset benr
  zfetching object 53 block 56 (131072 bytes) from dataset benr
  zfetching object 53 block 57 (131072 bytes) from dataset benr
  zfetching object 53 block 58 (131072 bytes) from dataset benr

Both these forms of prefetch add up to a very good way to maximize I/O activity. Are they perfect for every situation? No, the most common complain tends to be by databases which strictly work in fixed 8K blocks and manage their own caches very effectively. If you think you have such a case, file-level prefetch can be tuned on the fly using mdb, I encourage you to play with it and see what is best for your workload, but when in doubt leave it enabled. In common usage, both vdev read-ahead and file level prefetch can help make sure the I/O you may do has already been done and is sitting ready for you in DRAM.