Archive for January, 2009

Understanding ZFS: Transaction Groups & Disk Performance

Friday, January 23rd, 2009

I’ve been deeply concerned about the number of people who continue to use iostat as the means to universally judge IO as “good” or “bad”. Before I explain why, lets review iostat.

# iostat -xnM c0t1d0 1
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   19.9  240.9    2.1   18.4  0.0 13.7    0.2   52.4   4  56 c0t1d0
  127.2    0.0   14.7    0.0  0.0  1.0    0.1    7.7   2  78 c0t1d0
  116.0  375.0   13.6   21.6  0.1  3.2    0.1    6.5   7  82 c0t1d0
   27.0  407.0    2.6   30.8  0.1  9.3    0.1   21.5   6  99 c0t1d0
   95.0    6.0   11.7    0.2  0.0  0.9    0.1    8.6   1  78 c0t1d0
^C

The first 4 columns we can agree on, reads and writes per second. The following 6 columns get the concern. These are 3 different ways of viewing essentially the same data, the active queue (sent to device) and wait queue (waiting to be sent). Note that all these are based on Kstats and can be easily re-formulated into custom tools.

Universal “rules of thumb” regarding queues are very dangerous. I’ve heard such ridiculous suggestions as anything over 5% busy is a problem. The busy time simply denotes the quantity of time that IO was active… thus 100% busy means your doing a lot of IO, not that its slow. Over simplifying the interpretation is as old skool as suggesting that a CPU more than 75% busy needs to be upgraded, which is moronic (but sells a lot of servers).

When it comes to iostat you must carefully balance the numbers of IOs, size of IOs together with the service time and come to a conclusion based on that. If your doing a large streaming write you would expect to see 100% busy but very high throughput. Naturally, the great enemy of storage performance is random workloads that require a lot of head movement, in which case those seek times will kill you and things start to back up, in those cases tune your app… application caching is always a bigger win than adding spindles.

In interactive shared server environments the closest to a rule I’ve ever provided was that, in my experience, active services times below 30ms are optimal, between 30ms and 100ms worry some and higher than 100ms means that someone out there is probly unhappy. When IO’s are regularly taking more than 100ms to complete is likely that the next fellow to type “ls” in an uncached directory is going to be pissed off.

Now, all this gets more complicated with ZFS in the mix. You probly have heard that ZFS is transactional and as a result is always consistent-on-disk. But few really spend time thinking that through. Similar to an OLTP database, transactions are created, work is done, and then finally committed. This commit sends the transaction (tx) into a transaction group (txg) for “sync” to disk. At any given time there are 3 transaction groups: one in an open state accepting transactions, one is a quiescence state ready for sync, and one being sync’ed to disk. (For the sake of simplicity I’ll leave discussion regarding O_DSYNC synchronous writes and ZIL out of this discussion, for now.)

Between these transaction groups gathering writes in memory for orderly flush to disk and the ARC filesystem cache, most of your run of the mill IO is going back and forth between memory. I have a great many machines servicing more than 100,000 read ops per second without a single resulting physical read IO. ZFS efficiency is truly incredible. So in this way, the physical metrics can have little to nothing to do with the actual user-experience making typical tuning based on iostat highly suspect, if not entirely meaningless.

So, first things first. On a ZFS system never look at iostat alone. Always open 2 terminals side-by-side and in one terminal watch fsstat zfs 1 and in another watch iostat -xn 1 (or 10 seconds, whatever your happy with). By watching both of these you’ll get a better idea of whats really going on, and I expect that you’ll be impressed by what you see.

As for async writes. What I really would like to see is how these transaction groups are doing. How often are transaction groups sync’ing to disk, how much are they sync’ing, and how much time is there in between. Prior to snv_87 transaction groups would flush upon fullness (1/8th of system memory) or a txg_time tunable defaulted to 5 seconds. As a result, if you’ve looked at a system running something earlier than snv_87 and saw IO “spike” every 5 seconds, this is why… its normal and healthy. In snv_87 a new ZFS Write Throttle was introduced and among the changes the sync timer got pushed out to 30 seconds. So, likewise, if you have a box post-87 that “spikes” every 30 seconds you have a very healthy system.

Knowing this is all good and well, but I’d like to see it. After spending a good amount of time in the code I realized that spa_sync() is the function to watch. Its whats responsible for actually sync’ing the txg to disk (God Bless DTrace stack[] aggregations!). With this knowledge I wrote up a Dscript that I was proud of, but, of course, now that I knew what to look for, found that Roch wrote up essentially the same thing 2 years ago…. never the less, I made a tweek and here it is:

#!/usr/sbin/dtrace -qs

/*
 * spa_sync.d - ROCH http://blogs.sun.com/roch/entry/128k_suffice
 * mods by benr
 *
 * Measure I/O throughput as generated by spa_sync
 * Between the spa_sync entry and return probe
 * I count all I/O and bytes going through bdev_strategy.
 * This is a lower bound on what the device can do since
 * some aspects of spa_sync are non-concurrent I/Os.
 */

BEGIN {
        tt = 0; /* timestamp */
        b = 0; /* Bytecount */
        cnt = 0; /* iocount */
} 

spa_sync:entry/(self->t == 0) && (tt == 0)/{
        b = 0; /* reset the I/O byte count */
        cnt = 0;
        tt = timestamp;
        self->t = 1;
        printf("%Y", walltimestamp);
}

spa_sync:return
/(self->t == 1) && (tt != 0)/
{
        this->delta = (timestamp-tt);
        this->cnt = (cnt == 0) ? 1 : cnt; /* avoid divide by 0 */
        printf("t: %d MB; %d ms of spa_sync; avg sz : %d KB; throughput %d MB/sn",
                b / 1048576,
                this->delta / 1000000,
                b / this->cnt / 1024,
                (b * 1000000000) / (this->delta * 1048676));
        tt = 0;
        self->t = 0;
}

/* We only count I/O issued during an spa_sync */
bdev_strategy:entry
/tt != 0/
{
        cnt ++;
        b += (args[0]->b_bcount);
}

Here is a sample output, pre-snv_87:

# ./spa_sync.d
2009 Jan 23 06:12:28    : 44 MB; 743 ms of spa_sync; avg sz : 68 KB; throughput 59 MB/s
2009 Jan 23 06:12:33    : 81 MB; 1716 ms of spa_sync; avg sz : 79 KB; throughput 47 MB/s
2009 Jan 23 06:12:38    : 45 MB; 736 ms of spa_sync; avg sz : 65 KB; throughput 61 MB/s
2009 Jan 23 06:12:43    : 41 MB; 700 ms of spa_sync; avg sz : 67 KB; throughput 59 MB/s
2009 Jan 23 06:12:48    : 56 MB; 1287 ms of spa_sync; avg sz : 63 KB; throughput 43 MB/s
2009 Jan 23 06:12:53    : 35 MB; 668 ms of spa_sync; avg sz : 65 KB; throughput 52 MB/s
2009 Jan 23 06:12:58    : 61 MB; 1147 ms of spa_sync; avg sz : 62 KB; throughput 53 MB/s
2009 Jan 23 06:13:03    : 41 MB; 624 ms of spa_sync; avg sz : 60 KB; throughput 67 MB/s
2009 Jan 23 06:13:08    : 37 MB; 658 ms of spa_sync; avg sz : 60 KB; throughput 56 MB/s
2009 Jan 23 06:13:13    : 59 MB; 1035 ms of spa_sync; avg sz : 68 KB; throughput 57 MB/s
^C

Notice it hitting the 5 second mark nicely. This output is significantly more telling and encouraging than simply looking at iostat alone.

This of course also reminds us… Roch Rocks.

Death of a Hero: Robert Snively

Friday, January 23rd, 2009

Bob Snively is the storage giant you’ve probly never heard of. Employed at Sun for a number of years and recently moved to Brocade, if you dig into the SCSI and Fibre Channel standards you’ll see his name again and again. His accomplishments are greater than I can list with any accuracy.

Yesterday, in my SNIA presentation I mentioned “my personal hero, Bob Snively”… afterwards, 2 different people came up to inform me that he just passed away a couple days ago. I couldn’t be more devastated… I’d hope to find him around the show. In fact, I met him personally for the first time just in October at the last SNIA event, he seemed strong and healthy, but apparently was in remission at the time.

The official notice is here.

Next time you work on a system that uses a SCSI command set, whisper “Thanks Bob” for our main man.

Understanding ZFS: Disk Space Discrepancies

Wednesday, January 21st, 2009

Here’s a good ZFS Trivia question to bewilder your friends and common question: Why do the two following outputs for the same ZFS pool disagree?

# zfs list zones
NAME    USED  AVAIL  REFER  MOUNTPOINT
zones   634G   185G    69K  /zones

# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
zones                   832G    634G    198G    76%  ONLINE     -

If you add up USED and AVAIL from the zfs list output you get 819GB. So why does zfs list say we have 819GB but zpool list says we have a 832GB pool?

This is a question I have tried to answer in the past using zdb, quite unsuccesfully. I found that answer while digging through the ZFS code (dsl_pool.c):

    408 uint64_t
    409 dsl_pool_adjustedsize(dsl_pool_t *dp, boolean_t netfree)
    410 {
    411         uint64_t space, resv;
    412
    413         /*
    414          * Reserve about 1.6% (1/64), or at least 32MB, for allocation
    415          * efficiency.
    416          * XXX The intent log is not accounted for, so it must fit
    417          * within this slop.
    418          *
    419          * If we're trying to assess whether it's OK to do a free,
    420          * cut the reservation in half to allow forward progress
    421          * (e.g. make it possible to rm(1) files from a full pool).
    422          */
    423         space = spa_get_dspace(dp->dp_spa);
    424         resv = MAX(space >> 6, SPA_MINDEVSIZE >> 1);
    425         if (netfree)
    426                 resv >>= 1;
    427
    428         return (space - resv);
    429 }

So 1/64th of the pool is reserved? It makes sense, a copy-on-write filesystem is in trouble if it truly hits 100% Used. But do those numbers fit?

1/64th of ZPool:     832G * 0.016 = 13.3GB
Output Discrepency:  832G - 819GB = 13GB

Right on the money.

Therefore, let it be known! Above the capacity lost to whatever RAID scheme you choose, you will loose 1/64th of the pool to reserve. Based on prior experiments, I can verify definitively, when you hit 100% according to zfs list you can not write any more data. Therefore, never report or monitor ZFS capacity based on zpool list! As a point of interest, df also invokes this adjustedsize function and therefore outputs correct numbers. If your monitoring system tracks disk capacity based on df no change is needed.

The Showmanship of Steve Jobs, 20 Years Ago

Tuesday, January 20th, 2009

If you commonly watch Apple keynote events, I offer you a look back, almost 20 years, to a NeXTSTEP presentation in San Francisco. Notice how his style is almost unchanged in all this time.

Part 1:

Part 2:

I have one of these NeXT Turbo Slabs… amazing system. If you are interested in NeXT gear or software, I buy my gear from Black Hole, Inc..

The Clash of Community, Company, and Celebrity

Monday, January 19th, 2009

In my last post I noted a pinch of rage over John Plocher’s Sun RIF. A commenter rightly see’s a divided logic between my prior statements that Sun hasn’t downsided enough and my anger over a single individual whom I know. I’ve edited the prior post, but I think the paradox is an interesting one on which to expound.

What happens when someone of public notoriety or celebrity leaves a firm within which that person gained that reputation? This is not at all a new problem, but one that is, thanks to blogs such as this one and the ever hungry daily media, more vicious than in the past.

Within any given organization, each employee has some definable amount of presence. To co-workers, to other teams, to corporate partners, to friends, to family, to customers, to media, etc. The larger the footprint of that presence the more backlash may be felt when that person leaves. When anyone leaves an organization they cause the water to stir more than when they arrive. The cost of the departure increases with the size of there presence and influence. Even those with a small footprint can have a significant impact on moral. Those with a large footprint can impact the companies future as a whole.

Lets start with an extreme case in the news today… Steve Jobs. There are thousands of employees at Apple doing amazing work, but the companies image is tied to his face in a way that few other CEO’s are. The upshot is his amazing reputation. The downside is a void, which he did not prepare the world to expect by increasing the public presence of his fellow executives. Is this good leadership or bad?

A more common example, especially in the old-world-economy, was when a sales manager would leave… especially if it was abrupt or caused by a RIF. If improperly handled end customers could loose confidence in the vendor.

A less common but long running example is that of anyone within your organization who has written a book, column, or attained similar notoriety. Adrian Cockroft was an excellent example. The face of Solaris performance leaves Sun. Was it scandalous? No, but plenty of customers scratched their heads and wondered if this was a bad sign. Sun has many other such examples. Never the less, the number of people that fit into this category are few.

With the rise of corporate blogging and community building efforts the danger grows significantly. Rather than a handful of writers and speakers, there are now hundreds of externally known and respected employees. The advantages are clear, the enterprise is now a collection of human faces, not just a cold logo and press release. The danger is, however, equally clear… when customers see faces, what happens when you RIF some of them? Even if they leave under the best of terms, end customers now are interested in who will fill that slot in the org chart, and will that person be as forthcoming and public, etc. There are organizational and emotional ties that are being severed in a new and unique way, which are difficult to sometimes quantify.

Lets make this personal… if Sun was not so public, did not embrace blogging so much, and did not create a community that provided so much transparency, would commentaries such as this one even exist? As the comment to my former entry rightly pointed out, how can one deal with both the economic reality that re-orgs and cuts must occur to meet Wall Street expectations, while at the same time endure the public outcry of individuals who may be effected?

One thing that is clear, increasing your public notoriety is as essential in this new economy as building your resume. Not for vanity sake, but self preservation.

The RIF’s are unpleasant. I never wish to see anyone loose their job. Will they ever be surgical or otherwise perfect? Never, sadly. Naturally, we all have a variety of opinions on why this is so, but I don’t want to get into the dirt more than I already have.

In the case of communities the situation becomes all the more difficult. Not only do you impact the employee and internal co-workers, but also those external persons within the community. The same internal effects on moral and capability concerns now trickle outside. This only become more complex as their position and prominence in that community increases. The scars are made worse if that person does not go onto another position that affords them the same opportunity for involvement, which it rarely does, at least not with the equal ability for commitment.

In the OpenSolaris community we feel these RIF’s significantly. One might argue that we should not, business is business afterall, but the impact is inevitable. One can only hope that this impact is part of the consideration that goes into the decision to release that given employee. We have seen a great many leaders within our community come and go, both willingly and unwillingly. When combined with our current reliance on Sun leadership the resulting opinions and feelings are decidedly more pointed. As with the extreme example of Steve Jobs… is this good leadership or bad?

There are great opportunities and advantages to opening a company more and more… but there must also be reciprocal disadvantages, and this issue is chief among them. For customers and community members to not express disappointment, disapproval, or even the occasional rage would be inhuman.

OpenSolaris Governance Update

Sunday, January 18th, 2009

Since no one else is updating the community regarding OpenSolaris governance, I will…

Our Sun internal leader, Tim Cramer, has left Sun. According to his Facebook comments it appears he’s going to Dell of his own will. This was announced Jan 12th to the OGB-Dicuss list. There has been no other announcement regarding the matter that I can find.

Apparently Vincent Murphy, Engineering Director at Sun Microsystems, will take over Tim’s duties.

In a bit of sad and concerning news, John Plocher, an active, respected, and energetic member of the OpenSolaris Governing Board has been RIF’d (Reduction In Force, the new term for “downsized”). This is yet another example of Sun letting go some of its best and most energetic people. (edited; see followup post)

Mr. Plocher is an excellent man, skilled engineer and a natural born leader. Any place that he goes will be blessed and privileged to have him, and I pray that he has the opportunity to be a voice of reason in the continued evolution of OpenSolaris. Companies looking for energetic technical community leaders should jump at the opportunity to snatch him out of the market.

The OGB’s term will end soon. Nominations will likely start up at the end of February and a new OGB will be sitting as of April 1st (if we follow this historical schedule). I am afraid for the project because so much damage has been done that I can’t think of anyone who would want to sit on the utterly useless board we have.

If you are interested in why our governing board is in such a predicament, feel free to watch my OGB Presentation at the SVOSUG in Feb of ’08 (skip into the video by 1 hour for my talk). Slides are available here. The presentation is as pertinent today as ever.

PS: If anyone thinks I’m being inflammatory in this post, don’t kid yourself, I’m biting my tongue. And if you’re an outsider, no I don’t work for Sun.

Coming to a bookstore near you….

Friday, January 16th, 2009

I wanted to comment on two forthcoming books:

The first is Sharon Veach’s Solaris Security Essentials. This is currently available on Safari as a rough cut and looks very good so far. Sharon is an excellent writer and the book will be a welcome edition to any administrators library.

Co-writen by Nicholas A. Solter, Jerry Jelinek, and David Miner. I’m particularly excited for Jerry (I’m his biggest fan… GO ZONES!). writing a book is no easy task, especially in editing and review. Based on the table of contents there is a huge quantity of content here and my hat is off to them for getting it all done in under 1 year. Trust me, writing is a labor of love… for the time you put into writing a book you would make more getting paid minimum wage at 7-11. This should be a great text to supplement the documentation and excellent reference.

Jerry has a great post about writing the book in his blog.

Congrats to all!

Why I Hate EMC

Friday, January 16th, 2009

I do, I really hate EMC. And here is a great example of why:

If your reading this via an aggregator, here is the video embedded above: EMC Atmos featured on YouTube.

HE SAYS NOTHING!!! ABSOLUTELY NOTHING!!! What is the product? What does it actually do? How does it do it? THIS IS MARKETING BULLSHIT.

There, I said it. I am truly shocked that companies, in this day and age, can still get away with old school marketing jibberish which we became so accustom to by the HP’s of the world. IBM changed its ways, Sun changed its ways… gah. Really, am I the only one who goes nuts seeing product data sheets and so-called presentations which are just a lot of hot air?

EMC Atmos seems like an interesting product. My understanding is that its an outgrowth of UC Berkeley’s OceanStore, which I believe to be a milestone project in the history of computing storage, and thus Atmos is of great interest to me, but apparently the last entity I should rely on for information regarding it is EMC themselves.

Ok… rant over. I’ll bury this with a technical posting.

UPDATE NOTE: I want to be clear and apologies to the fellow in the video if he comes across this. This is nothing against his particular presentation, but as an example of EMC in general. It was clearly highly scripted and he himself probly doesn’t like it. Anyway, he did a fine job, its the messaging method that bothers me. There are plenty more like this on YouTube, or just go read EMC’s website.

OpenSolaris Storage Summit 2009

Friday, January 16th, 2009

Don’t forget…

The second OpenSolaris Storage Summit will be the day prior to USENIX FAST ’09, at the Grand Hyatt San Francisco. Feb 23rd, the Summit is where you want to be! This is a free event and lots of fun.

Register Here. Just add your name to the list and your good to go. I hope to see you there.

In related news… the SNIA Winter Symposium 2009 is coming up in just a week (January 20 – January 23). I’ll be speaking about Clould Storage on Wed afternoon, so if your there please stop by and say hello!

DB2 Express C for Solaris/X86

Thursday, January 15th, 2009

DB2 has been off my radar for a while, but I circled back around due to my continuing rage at Oracle and their stupidity in not releasing Oracle 11g for Solaris/X86. To hell with ‘em… IBM has done a wonderful thing and released DB2 9.5 for Solaris/X86… whats more, there is an Express C version. Express C is a thinned down free version of DB2. This means that you can legally run a modern, enterprise grade database at no cost on the worlds best enterprise grade OS. DB2 and OpenSolaris is an awesome tag team. You can download it now: DB2 9.5 (Viper2) Express C

“But I don’t know DB2!” you say? Good news! Like Sun IBM has an awesome campus evangelism effort, and there is a complete DB2 on Campus video series that will train you up. There is also an excellent free eBook: Getting Started with DB2 Express C. Pair that up with the industries best documentation effort (IBM Redbooks) and you can’t loose.

And Joyent fans… yes, DB2 will run beautifully in a Joyent Accelerator. Screw Oracle… lets here it for DB2!