Archive for May, 2009

CommunityOne Domination

Friday, May 29th, 2009

I’m up to 3 sessions! Come say “hi” at CommunityOne West… I’ll be:

  • Presenting a 50 minute ZFS talk on Monday June 1st, focusing on features and application.
  • Presenting a 20 minute Use Case for Crossbow at the Crossbow BOF Monday evening.
  • Presenting a 2 hour zero-to-hero Becoming a ZFS Ninja “deep dive” on Tuesday.

Lots of quality goodies at the show. If you have any suggestions for storage topics I should cover in my 50 minute session please let me know. I want to make sure I’m helping people learn, not just rambling like a marketing droid.

Nirvanix: Cloud hype at its most annoying

Thursday, May 28th, 2009

Nirvanix is a cloud storage company that offers several solutions around their Storage Delivery Network(SDN). Nirvanix is at its core an API-based cloud storage solution, similar to Amazon S3. In fact, what they’ve done is simply created a “better-s3-than-s3″, solution which adds a lot of intelligence to the backend storage to give you the benefits of global load balancing and Content Delivery Network (CDN) capabilities. As the CEO says, take Amazon S3 and smash it together with the Akamai CDN and you get Nirvanix.

Here’s what irritates me…. they keep beating on this “the box is dead” drum. That slogan is misleading at best, and hypocritial at worst. Hypocritical because one of the products they sell based on the Nirvanix SDN is a gateway NAS software solution called Cloud NAS, which is really just a FUSE module for the SDN. So I think what they really mean is “their box is dead… ours is great.” A much more truthful slogan might be that “disk is dead”, as you could call their CloudNAS box a “diskless storage server”.

There is really a lot of confusion in the cloud space between what is available as a common protocol storage solution (NFS, CIFS, iSCSI, etc) versus some API based solution like Nirvanix or S3. Are you going to run Exchange on Nirvanix? No, although you might backup Exchange to a service offered by some company which as its backed uses the Nirvanix SDN.

The same problem exists for S3… people commonly claim they use S3 for all sorts of things, but more often than not, they aren’t using S3 at all… rather they are buying a service from someone who themselves uses S3 as the backend store. It’s misleading and confusing to new consumers and busy IT pro’s who can’t keep up all the time.

So, Nirvanix is a great solution for any of you currently unhappy with S3′s scalability…. but, the box is not dead. Not by a long shot. Just a lot of marketing hype that confuses potential customers from fully understanding what the product really is.

If you are in the real world, not developing your own applications, and are interested in real storage protocols that anyone can use out of the box, go check out Zetta (currently in beta). Zetta is currently planning to offer multi-protocol access (NFSv3,v4, CIFS, WebDav, SFTP, with more to come) on a pay-for-what-you-use model. No setup, just mount and go…. that is the way it ought to be.

    Perhaps the best way to distinguish the two approaches is based on audience…

  • If you’re a (web) developer who just wants a way to store data that you don’t own… consider Amazon S3 or Nirvanix SDN.
  • If you’re a SysAdmin who wants a remote data store that inter-operates with your existing infrastructure… go Zetta.
  • … alternatively, if you want local storage that actually works and is easy to use, get an open storage platform.

Just as an aside… Sun knows how to smash stuff that’s actually working and continues to work. Far more entertaining than beating up eBay scrap systems in a field.

Understanding ZFS: Prefetch

Thursday, May 14th, 2009

One of the great mysteries of ZFS is prefetch. Failing to understand these, how they work, and what they intend to do for you, can cause a lot of confusion, so here we’ll dig our fingers into the subject. The first thing to understand is that “ZFS Prefetch” may refer to file-level prefetch and/or the virtual device read-ahead cache; we’ll discuss both here.

VDev Read-Ahead Cache (SPA)

When reading data from spinning media, the bulk of the service time is spent positioning the disk head. Once the head is in place, actually reading data is super speedy. To capitalize on this reality, ZFS’s vdev_cache is a virtual device read ahead cache. There are 3 tunables:

  • zfs_vdev_cache_max: Defaults to 16KB; Reads smaller than this size will be inflated to zfs_vdev_cache_bshift.
  • zfs_vdev_cache_size: Defaults to 10MB; Total size of the per-disk cache
  • zfs_vdev_cache_bshift: Defaults to 16; this is a bit shift value, so 16 represents 64K. If you reduce the value to 13 it represents 8K.

So the idea is that if you go to all the trouble of moving the disk head(s) around to get some data, you might as well make it worth your while. If its a large I/O its just put into the cache until someone comes back for it, but if its a small I/O, just grab a bit more data while your already there and cache it just in case. Due to this design, even if you don’t come back for that data later, you aren’t loosing much if anything… you already had to accept the latency for the I/O anyway. The data is stored in a per-vdev (typically a physical disk) cache of 10MB. Unlike ARC, this is an old-school LRU (Least Recently Used) cache which just rolls data through. This cache applies only to reads, its a read-ahead cache, not a write cache (if you want that, see ZIL).

If you dig around in the code this read-ahead cache is implemented in the zfs_zfetch_* functions. Here is a sample dtrace script to watch it in action:

#!/usr/sbin/dtrace -qs

/*
 * Used for monitoring vdev_cache (per-disk block prefetch)
 *   vdev_cache_read(); Returns 1 on cache hit, errno on miss.
 *   vdev_cache_allocate(); Creates new cache entry in LRU
 *   vdev_cache_write(); Update cache entry on write
 *   vdev_cache_evict(); Evict entry to make room for new...
 *   vdev_cache_purge(); Purge cache.
 */

fbt:zfs:vdev_cache_read:entry
{
	self->io_offset = args[0]->io_offset;
	self->io_size = args[0]->io_size;
	self->vdev_guid = args[0]->io_vd->vdev_guid;
	self->start = timestamp;
}

fbt:zfs:vdev_cache_read:return
/self->io_offset/
{
	printf("%s: %d read %d bytes at offset %d: %sn", probefunc, self->vdev_guid, self->io_size, self->io_offset, args[1] == 0 ? "HIT" : "MISS");
}

Here’s what it looks like when it runs. Please note the long int is the GUID (Global Unique Identifier) of the disk in question, useful for figuring out whether I/O’s are the the same disk or multiple:

vdev_cache_read: 14901677279437128904 read 131072 bytes at offset 138394293760: MISS
vdev_cache_read: 14901677279437128904 read 131072 bytes at offset 138394555904: MISS
vdev_cache_read: 16158491400258519295 read 65536 bytes at offset 138394489856: MISS
vdev_cache_read: 16158491400258519295 read 129536 bytes at offset 138394948608: MISS
vdev_cache_read: 16158491400258519295 read 84992 bytes at offset 138395151360: MISS
vdev_cache_read: 16158491400258519295 read 40960 bytes at offset 138395281408: MISS
vdev_cache_read: 14901677279437128904 read 131072 bytes at offset 138394686976: MISS

Please note, I offer the above script for those looking to better understand the internals of vdev_cache, not as a production tool.

Now here’s the kicker… once upon a time the read-ahead cache cached all data blocks, but the performance hit was such that almost all I/O got pushed to 64K and the cache rolled over way too fast. So in Nevada Build 70 (snv_70) it was augmented to only cache on meta-data requests. In the ZIO pipeline, one of the stages is pulling the block pointer, at which point it will see if its meta-data, and if not it sets a DONT_CACHE flag.

Along with the snv_70 changes, kstats were added. The kstats are aggregated for all vdev_caches, so you don’t get per disk granularity, but they can be handy (or at least interesting) observability points:

benr@quadra ~$ kstat -p zfs:0:vdev_cache_stats
zfs:0:vdev_cache_stats:class    misc
zfs:0:vdev_cache_stats:crtime   23.410498211
zfs:0:vdev_cache_stats:delegations      9960
zfs:0:vdev_cache_stats:hits     12725
zfs:0:vdev_cache_stats:misses   20865
zfs:0:vdev_cache_stats:snaptime 886853.023016843

So, you can tune the read-ahead cache, but should you? The answer is almost universally no. Of the 3 values above the only one I’ve seen tuned is bshift, changing from 16 to 13, thus only inflating writes to 8K. This should only be done if I/O is very expensive and extremely random and if your running snv_70 or older. NFS or Mail Servers are examples, or if your using iSCSI over a slow link.

File-Level Prefetch (DMU)

While the vdev read-ahead cache is implemented within the ZFS IO pipeline in the SPA layer, file level prefetch occurs up in the DMU layer and feeds data to the ARC (Adaptive Replacement Cache, the pool wide read cache). When you hear about ZFS’s intelligent prefetch, this is what they are talking about.

The prefetch is pretty simple, in a nutshell if read a block we prefetch the next block. If you read that block we just prefetched, we go and grab 2 blocks. Got those blocks? Grab 3. Its more intelligent than this, but you get the idea. This can go all the way up to prefetching 256 blocks at a time. This behavior really benefits sequential read streams, where the data is being prefetched in the ARC before you actually need it which means more efficient disk I/O and faster response. It actually gets even more complex than this, because it can determine if multiple co-linear streams are being prefetched and collapse them into a single strided prefetch. If you’ve seen the ZFS: The Last Word in File Systems presentation, this whats happening when multiple users watch the Matrix.

Tuning file-level prefetch is straight forward… you can leave it on (default) or turn it off via zfs_prefetch_disable. Should you? No, not unless your workload is extremely random, and always random. Because ZFS prefetches intelligently, the overhead is minimal because if it’s not useful it won’t ever prefetch much. If you’re doing random access to 8K or smaller files on iSCSI, you should benchmark with it turned off, otherwise let it be. In my experience disabling prefetch, even on heavily loaded systems, had little or no impact on reducing random physical disk I/O. When in doubt, keep it enabled.

Can you monitor file level prefetch? Of course! Look for the dmu_zfetch() functions, which are called out to be dbuf_read() calling out to dbuf_prefetch and then to the various dmu_zfetch_*() functions. Here is an example script:

#!/usr/sbin/dtrace -qs 

/* This should look at the arguments to dmu_zfetch_fetch */
fbt:zfs:dmu_zfetch_fetch:entry
{
        self->object = args[0]->dn_object;
        self->blockid = arg1;
        self->blocks  = arg2;
        self->start = timestamp;
        printf("Prefetching object %d from dataset %s : %d blocks for blockid %dn",
			args[0]->dn_objset->os_dsl_dataset->ds_dir->dd_object,
			args[0]->dn_objset->os_dsl_dataset->ds_dir->dd_myname,
			arg2, arg1);
}

fbt:zfs:dbuf_prefetch:entry
{
	printf("  zfetching object %d block %d (%d bytes) from dataset %sn",
			args[0]->dn_objset->os_dsl_dataset->ds_dir->dd_object,
			arg1,
			args[0]->dn_datablksz,
			args[0]->dn_objset->os_dsl_dataset->ds_dir->dd_myname  );
}

This script will watch for dbuf_prefetch to request a bunch of prefetch blocks, and then the dmu_zfetch_fetch function which actually goes and gets them:

Prefetching object 53 from dataset benr : 2 blocks for blockid 41
  zfetching object 53 block 41 (131072 bytes) from dataset benr
  zfetching object 53 block 42 (131072 bytes) from dataset benr
Prefetching object 53 from dataset benr : 3 blocks for blockid 43
  zfetching object 53 block 43 (131072 bytes) from dataset benr
  zfetching object 53 block 44 (131072 bytes) from dataset benr
  zfetching object 53 block 45 (131072 bytes) from dataset benr
Prefetching object 53 from dataset benr : 4 blocks for blockid 46
  zfetching object 53 block 46 (131072 bytes) from dataset benr
  zfetching object 53 block 47 (131072 bytes) from dataset benr
  zfetching object 53 block 48 (131072 bytes) from dataset benr
  zfetching object 53 block 49 (131072 bytes) from dataset benr
Prefetching object 53 from dataset benr : 4 blocks for blockid 50
  zfetching object 53 block 50 (131072 bytes) from dataset benr
  zfetching object 53 block 51 (131072 bytes) from dataset benr
  zfetching object 53 block 52 (131072 bytes) from dataset benr
  zfetching object 53 block 53 (131072 bytes) from dataset benr
Prefetching object 53 from dataset benr : 5 blocks for blockid 54
  zfetching object 53 block 54 (131072 bytes) from dataset benr
  zfetching object 53 block 55 (131072 bytes) from dataset benr
  zfetching object 53 block 56 (131072 bytes) from dataset benr
  zfetching object 53 block 57 (131072 bytes) from dataset benr
  zfetching object 53 block 58 (131072 bytes) from dataset benr
...

Both these forms of prefetch add up to a very good way to maximize I/O activity. Are they perfect for every situation? No, the most common complain tends to be by databases which strictly work in fixed 8K blocks and manage their own caches very effectively. If you think you have such a case, file-level prefetch can be tuned on the fly using mdb, I encourage you to play with it and see what is best for your workload, but when in doubt leave it enabled. In common usage, both vdev read-ahead and file level prefetch can help make sure the I/O you may do has already been done and is sitting ready for you in DRAM.

Become a ZFS Ninja at CommunityOne West

Friday, May 8th, 2009

JavaOne is coming up, the first week of June, and that means CommunityOne is back! There is a whole week of goodness, starting with the HA Cluster Summit Sunday May 31st, then CommunityOne June 1st thru the 3rd, and JavaOne the rest of the week.

On June 2nd, I’ll be giving a 2 hour zero-to-hero talk on ZFS to give you Ninja like skills. Here is a brief outline of the talk:

  • Creating Pools & Layout Schemes (RAID)
  • Pool Maintenance & Handling Physical Devices in Solaris
  • Creating Filesystem Datasets
  • Manipulating Dataset Properties
  • Creating Volume Datasets
  • Snapshots
  • Replication (zfs send/recv)
  • Backup Considerations & Methodology
  • Sharing Filesystem Datasets with NFS
  • Sharing Filesystem Datasets with CIFS
  • Sharing Volume Datasets with iSCSI
  • Intro to Related Technologies: COMSTAR, Cloud & Virtualization Applications, SNDR Replication, etc.

This is a deep dive talk with the intention of being comprehensive, so if your new to ZFS please join us, but more especially, if you’ve got some ZFS experience but think you might have some gaps or want to learn about aspects you may have missed or topics not found in the manuals, this is the place to really hone your ninja skills.

My talk is the 2nd in a 4 session track . I’ll follow Chris Armes who will talk about deployment, then following me is Nick Solter (co-author of The OpenSolaris Bible) to talk about OpenHA Cluster, and Jerry “The Man” Jelinek (another co-author of The OpenSolaris Bible) to talk about Containers (aka: Zones) and Virtualization.

Despite what you may have read, this deep dive is free. Please, come one come all and hone your OpenSolaris skills and have some fun. This is an excellent opportunity for free training, don’t pass it up! And this is real training… we’re focusing on teaching skills, not just presenting marketing slides… you will learn something. :)

Becoming a ZFS Ninja… part of an all day OpenSolaris Track…. CommunityOne San Franciso…. June 2nd…. be there!

ONStor Pantera LS 2100: ZFS in a Can

Wednesday, May 6th, 2009

Here’s something in the “old news I didn’t catch” dept… ONStor Pantera LS 2100 “a breakthough storage platform that delivers enterprise class features at entry level prices.” Why do we care? Its ZFS based. Joining similar storage solutions NexentaStor and almighty Sun Storage 7000 Unified Storage Systems (aka Amber Road + FishWorks).

ONStor is offering two configs:

  • The LS 2130: 4 Intel Cores, 8GB of RAM, 2 Gigabit Ethernet ports, up to 48 disks.
  • The LS 2150: Which doubles the specs of the 2130, offering 8 Intel Cores, 16GB of RAM, 4 Gigabit Ethernet ports, and up to 96 disks.

Both configurations offer a 10Gb Ethernet option and utilize 3U 15 Disk (3.5″) enclosures for expansion. Frankly, based on the pictures on the site it appears to simply be re-badged Dell PowerEdge 2950 and PowerVault SAS arrays in a pre-configed format. Like other players in this space your paying for the integration and support of a turn-key solution as opposed to going the hard-core do-it-yourself route.

ONStor is a quality brand in the storage industry and I’m glad to see them capitalizing on the power of ZFS. Certainly NexentaStor has a solid lead on them, but the more players we have the better for everyone… there is enough storage business to go around.

… and its named Pantera…. how can you not love that?

I haven’t actually tried the ONStor, but if anyone from the company wants to send me a demo unit I’d be more than happy to write up a solid review on it. ;)