Archive for October, 2008

Explore Your ZFS Adaptive Replacement Cache (ARC)

Wednesday, October 29th, 2008

Some time ago I wrote a tool which I call arc_summary. It is a kstat based PERL application that runs some calculations and presents you with a pretty report of ZFS ARC. The idea is to help you interpret the data more appropriately. Lets look at some output:

benr@quadra ~$ ./arc_summary.pl
System Memory:
         Physical RAM:  4083 MB
         Free Memory :  130 MB
         LotsFree:      63 MB

ZFS Tunables (/etc/system):

ARC Size:
         Current Size:             1530 MB (arcsize)
         Target Size (Adaptive):   1555 MB (c)
         Min Size (Hard Limit):    382 MB (zfs_arc_min)
         Max Size (Hard Limit):    3062 MB (zfs_arc_max)

ARC Size Breakdown:
         Most Recently Used Cache Size:          100%   1555 MB (p)
         Most Frequently Used Cache Size:         0%    0 MB (c-p)

ARC Efficency:
         Cache Access Total:             2090556
         Cache Hit Ratio:      71%       1493620        [Defined State for buffer]
         Cache Miss Ratio:     28%       596936         [Undefined State for Buffer]
         REAL Hit Ratio:       70%       1472533        [MRU/MFU Hits Only]

         Data Demand   Efficiency:    98%
         Data Prefetch Efficiency:     1%

        CACHE HITS BY CACHE LIST:
          Anon:                        0%        12854                  [ New Customer, First Cache Hit ]
          Most Recently Used:         55%        825782 (mru)           [ Return Customer ]
          Most Frequently Used:       43%        646751 (mfu)           [ Frequent Customer ]
          Most Recently Used Ghost:    0%        3619 (mru_ghost)       [ Return Customer Evicted, Now Back ]
          Most Frequently Used Ghost:  0%        4614 (mfu_ghost)       [ Frequent Customer Evicted, Now Back ]
        CACHE HITS BY DATA TYPE:
          Demand Data:                60%        900081
          Prefetch Data:               0%        9547
          Demand Metadata:            38%        572127
          Prefetch Metadata:           0%        11865
        CACHE MISSES BY DATA TYPE:
          Demand Data:                 2%        12318
          Prefetch Data:              87%        524549
          Demand Metadata:             9%        55222
          Prefetch Metadata:           0%        4847
---------------------------------------------

First notice that you do not need to be root to run the report. I ran this on my personal workstation, an Intel QuadCore with 4GB of RAM.

So looking at the report, first I output some general memory data for you, then output any ZFS tunable you placed in /etc/system (this will not utilize MDB, but may one day).

Then we look at the ARC Sizing data. The “Current Size” is how large the ARC really is, whereas the “Target Size” is a constantly changing number, like a stock price, of what the ARC thinks it should be. I also include the Min and Max hard sizes for ARC, these are tunable.

The ARC Size Breakdown shows the division of the cache between “Most Recently Used” (MRU) and “Most Frequently Used”. More on this later. When you see parenthesis in the output those are the variable names, for those using DTrace or getting into the code.

I wrote arc_summary to answer the question “Why is my ARC so damned big!?!” The ARC Efficiency section helps you determine where or not your getting bang for your buck. ZFS ARC stats presented by Kstat provide several breakdowns of cache hits, so this section really runs computation on those values to present them in a sane manner.

In the report above, we have a cache hit ratio of 70%. The “Cache Hit Ratio” and “REAL Hit Ratio” differ in that ZFS considers an Anon buffer as a hit… but its not really, so I remove those for the “REAL” ratio.

We then breakdown Demand vs Prefetch hits. ZFS does aggressively caches data via prefetch; in the example above, only 1% of hits are a result of prefetched data already being in the cache, 98% of it was hit in the cache because it was explicitly asked for. So in this case, pre-fetch isn’t helping me.

Perhaps the most interesting information here is the “Cache Hits by Cache List”. ARC maintains multiple cache lists, including:

  • Most Recently Used: Cache hit once.
  • Most Frequently Used: Cache hit multiple times.
  • Most Recently Used Ghost: Objects that were in MRU cache, but removed to save space.
  • Most Frequently Used Ghost: Same but for MFU.

These Ghosts lists are magic. If you get a lot of hits to the ghost lists, it means that ARC is WAY too small and that you desperately need either more RAM or an L2 ARC device (likely, SSD). Please note, if you are considering investing in L2 ARC, check this FIRST.

So by looking at the spread between MRU and MFU hits, we get an idea of how much the cache is rolling over…. 55% of hits are MRU (1 hit), whereas 43% are MFU (multiple hits), so, not bad. The higher MFU is, the better, but this all depends on your workload. Workstations will tend toward MRU, but on one of my servers, for instance, the ARC was 99% MFU.

Looking on we can see what type of requests are hits or misses. Notice that cache hits are 60% demand (explicitly requested) data blocks, and 38% demand metadata…. prefetch isn’t helping on this box.

Its all really pretty self explanatory, if its not, let me know and I’ll do an ARC expose, but I tried to create the output report to be pretty intuitive.

Please note, this report is cumulative since boot. It should compliment the arcstats tool. arcstat can tell you want is happening, arc_summary can tell you what has been happening. Any serious ZFS deployment should have both of these bad boys around in your toolbag.

On a closing note… I hope this tool helps you appreciate the innovation of the ARC. Designed at IBM (IBM ARC page) and then implemented and improved for ZFS, ARC is elegant, powerful and extremely efficient! I hope my tool can help you better appreciate this amazing innovation in some small way.

Best. Wife. Ever.

Wednesday, October 29th, 2008

I’m older… it won’t stop. But my wife makes me feel better with kickassness. Behold my “new” mint condition Faber-Castell 2/83N.

To compliment, I just recently bought a copy of Euclid’s The Elements. I love mathematics…. I suck at it, but who cares, I enjoy it.

Blastwave Saga Continues

Friday, October 10th, 2008

You’ll recall my “unofficial” Blastwave update in August. The separation of Phil and Dennis seems to be solidifying a bit. The CSW Project (Phil and co.) are now up and running at OpenCSW.org. Minus the blastwave name and CSS, its more or less the same site and content you had/have at blastwave.org.

OpenCSW.org includes a History page, which tells the CSW side of the breakup, minus the unverified assertions by Dennis Clarke that Phil Brown tried to setup with a company in Europe. To this end, Dennis has, naturally, written a response: A simple response to the OpenCSW crew, a blog entry that is simply a re-post of a CSW Maintainers list thread.

If you are interested in the unfolding soap opera, read and enjoy. If you simply enjoy using CSW/Blastwave software, nothing has yet changed sufficient to cause any reason to notice.

Please note, if you are new to this saga and read through the links above please bear in mind why Solaris 8 support is such a big issue. See, most of the existing Solaris install base is still running Solaris 8 on SPARC, furthermore, because of Solaris binary compatability, anything built on Solaris 8 (Solaris 2.6 for that matter) will run on Solaris 9, Solaris 10, Nevada or OpenSolaris… the only downside is that building software on Solaris 8, due to its age, can be a big more painful than on Solaris 10. The argument then falls into two camps: deal with the pain and be available to more users, or dump older releases to make builds easier and thus hope to attract new maintainers. In a nutshell anyway. Enjoy.

The Cascading Crash and You

Thursday, October 9th, 2008

I’ve been surprised how few people are blogging about the current market situations. Its coming up in casual conversation “around the water cooler” and dominates the news channels, but tech outlets aren’t jumping on the bandwagon.

Today on CNBC’s “Fast Money”, following todays almost 700 point drop, bringing the Dow under 9,000, I heard the first mention of the word “crash”, or more properly what they called a “cascading crash”. IBM perked up shortly but Forbes is saying IBM won’t save tech. Sun shares are back to pre-reverse split levels at $5.21 a share, an 8% drop thats in line with most other tech stocks today.

Despite the markets falling apart it seems to me that most “joe 4 packs” (I drink Guinness) are trying to stay calm. The big question that folks are trying to avoid is at what point will see jobs being cut as a direct result of the markets. So far the people that I’m talking to aren’t freaking out but seem secretively concerned.

The industry wisdom seems to be that this is yet another situation where using technology as a competitive weapon and means to reduce cost may save the day. So far thats what “they” are saying but we’re not seeing the effect quite yet.

As for myself, I’ve been blessed so far. When we recently bought our house we liquidated most of our equities and now have some to give back. I’m jumping on the Buffet bandwagon and looking for value, given that we have very little now to invest, and yesterday picked up some GE shares at a great price. Even though GM and Ford are in the tank and dropping quickly, it looks like a great value for a long horizon investment.

So drop a comment, as a time capsule to ourselves…. how are you feeling about the current financial crisis and its impact on your livelyhood?

Storage Trends from SNIA SDC

Wednesday, October 1st, 2008

The Storage Networking Industry Association’s (SNIA) Storage Developer Conference (SDC) is not, as the fancy name suggests, not a place for storage hobbyist or the light hearted. Attendees are leaders in our industry, highly informed and knowledgeable. If they are interested in it, we all will be soon. If you follow the storage press at all, the two big things on their mind won’t surprise you:

  1. De-duplication
  2. Solid State Disk (SSD)

From performance talks, to corruption analysis talks, to ZFS talks, to NFSv4 talks, every session included a slide for or was asked a question about both of these. Frankly, there were very few answers. Sun’s “hybrid storage architecture” for ZFS (for those in the know, this is L2ARC and ZIL offload, which are put on special SSD’s). Most of the talks only noted “SSD will change everything… its too early to tell how.” Given that the concern of the show is largely on primary storage, not secondary backup, de-dup was constantly come up but rarely had a place.

If de-duplication is a new term for you, here’s the quick and dirty pitch. Imagine having to architect backups for 300 helpdesk PC’s, all are running a standardized Windows XP, office stack, plus helpdesk support and naturally other user applications. Lets say the average PC has 80GB of data on its local drive. So thats 300 * 80GB to back up, perhaps nightly. A nightmare. Historically, to reduce the backup load by either putting user home directories on a centralized file server and just not backup PC’s, only the file server, or you’d exclude paths such as C:/Windows (or whatever the hell they call it now). De-duplication typically uses hashing algorithms either on the client or on the backup server to reduce storing duplicate data blocks. So that means you only backup one copy of Windows XP, and then 299 references to it. If someone sends out a PDF of the company handbook thats 5MB, and there are 300 local copies of it, thats 1.5GB of the same file, but with de-duplication we store only a single 5MB file plus references to it.

From the example you can see that customers backing up Oracle databases or customized purpose build servers might not be in dire need of this technology (although they are interested too), but if your backing up server farms or desktop systems this is something you can’t wait another second to get your hands on; especially if your backing up to tape!

I should note, de-dup is becoming more than just a backup technology. Storage admins see applications for file servers and other applications. I’m certain that in 5 years de-duplication methodology will be used in ways I’d laugh at today.

As for SSD. Its coming. I remember 10 years ago in a lab where we had a “Solid State Disk”, which in the pre-flash era meant a box with bank upon bank of RAM and a big battery. Today SSD is cheap and getting cheaper. But how will they be used?

Today we have the concepts of “tiered storage”. This means different things based on who you talk to. In some cases such as Pillar Data this is done by partitioning drive cylinders so that tier 1 data is on the outer (faster) tracks and tear 2, 3, 4 on the inner (slower) tracks. In other cases this means putting important fast access data on smaller 15K or 10K RPM FC or SAS disks as “tier 1″, and bulk data on larger “nearline” 7,200 RPM SATA disks. For customers using HSM (Hierarchical Storage Management) you can even automate the data migration back and forth across tiers, all the way out to tape drives which was untill recently cheaper per gig than disk.

So many storage administrators and architects seem to see SSD pushing into tier1 and pushing 15K spinning media down the stack. Instead of Fast, Slow, Tape, you get Super-Fast, Fast, Slow and potentially just dump tape.

I know I’m a zealot, but Sun really is leading the charge here. The Hybrid Storage Pool architecture is really brilliant because it views SSD not as faster disks, but rather as slow (relatively of course) non-volatile memory. Traditionally you have an in-memory filesystem cache (ZFS’s is called “ARC”), data flows through the cache and eventually is ejected to make room for fresher data meaning that if you call that data again you go out to disk. ZFS’s L2ARC (Level 2 ARC) extends your in memory disk cache using SSD, so if you go back for data you don’t have to go all the way out to disks. On busy file servers this is a massive win! A 64GB SSD is a really small disk, but as a secondary disk cache its massive! Plus, there is no management involved on the administrators part, no data policy or data classification to work out, the filesystem handles it for you.

Sun’s other component to the ZFS Hybrid Storage Architecture is ZIL Offload. Most data access is asynchronous can be nicely cached and writes flushed to disk when its convenient. However, some applications such as databases or NFS do synchronous (O_DSYNC) IO, this flag requires that the filesystem immediately flush the data to stable storage. On a busy file server this is a performance killer. ZFS ZIL (ZFS Intent Log) is where these synchronous writes go; by putting those writes on super-fast SSD you get several orders of magnitude performance improvement without relying on things like RAID Controller Write Back Caches.

Since we’re talking about SSD, let me point out that not all SSD’s are created the same. There are two main types of SSD on the market right now: MLC and SLC. Here’s the 60 second explanation:

  • Single-Level Cell (SLC): These flash devices have higher performance, more write/erase cycles and thus greater endurance, use less power, but cost much more. These are generally considered “Enterprise Grade SSD”.
  • Multi-Level Cell (MLC): In contrast to SLC, these devices have lower performance, less endurance, but offer might higher density and lower cost per bit. If you see a “cheap” $300 SSD at Fry’s or NewEgg its almost certainly MLC. These are generally considered “Consumer Grade SSD”.

If you see a Sun presentation on Hybrid Storage, you’ll see them refer to these as “Read Biased” (MLC, slower but higher capacity) and “Write Biased” (SLC, faster but less capacity). By using the appropriate technology in the appropriate role they significantly reduce cost for an SSD deployment. If you look at everyone else out there just viewing SSD is “fast disk”, the decision between SLC and MLC is really just a matter of cost; if you can afford SLC great, if not MLC, or perhaps even sub-teiring SLC to MLC SSD.

So thats de-dup and SSD. If you haven’t heard of these, you will. Familiarize yourself with the basics now, you’ll be better prepared for the future.

On a closing note. I talked to several people about SMART data. I’m shocked by how many people tell me to ignore SMART data as untrustworthy and unreliable. I was hoping someone at the show would disagree… I was disappointed. Most other experts agree, vendors don’t trust SMART data and in some cases outright “fudge” the data or at the least disregard conclusions based on the data. On person remarked that most drives sent to Seagate due to a SMART suggested failure are simply scrubbed, cleared, and re-shipped. So, the belief that SMART data is something to be seriously monitored by admins continues. If you have it, nifty, but if not, oh well. As for me… I love telemetry, so SMART still has a warm spot in my heart, wrinkles and all.

UPDATE:: Just hours before I wrote this, Mr. Harris of StorageMojo wrote about NetApp’s efforts to bring de-dup to primary storage.

Exciting Goings On

Wednesday, October 1st, 2008

There are a lot of exciting things going on… here are some I think you might enjoy.

Iron Man releases on DVD & BluRay today. At Target BluRay was sold out by lunch even out here in Tracy (ie: the sticks). If you’ve got a PS3 nab it, if you don’t have a PS3, buy one its a tremendous value if you compare it against the cost of a vanilla BluRay player.

Mark Driscoll of Mars Hill Church in Seattle presents its new sermon series: The Peasant Princess: A study of the Song of Songs. If you don’t know what the Song of Songs is, its the book of the Bible that focuses on sex. Ya, you heard me, one entire book of the Bible is all about sex, as in having it and having good sex.

Christianity is about living life the way our Creator intended us to. God put nerve endings in fun places for a reason! Most preachers won’t touch the Song of Songs with a 20 foot pole because they can read it without blushing. Shame on them! It’s a wonderful book, an explicit book, and lays out the great wonders that are intended for a married (maaaaaaarrrrrrrrriiiiiiieeeeeeeeeddddddd!) couple to draw closer to each other than you thought was even possible.

Mark Driscoll has the guts to take on this book. I encourage you, sit down with your wife or husband and watch these sermons together as a couple; Christian or not you’ll learn something and grow closer together and hopefully even have a new perspective on the life that Jesus Christ and God the Father have in store for us. Watch them all here.

The first ever Grand Prix of Singapore was this weekend, and it was the first ever night race for Formula 1. And what a race it was! Absolutely beautiful, the cars look better than ever in those lights. Hit YouTube for all the video you can get if you missed it!

If you missed the news, back in June the next installment of Final Fantasy Tactics was given to the world: FFT A2 for the Nintendo DS. Any FFT die hards (such as Tamarah and myself) will want to pick it up. Of course, a PSP re-release of the origonal FFT came out a while back, we both used it as an excuse to pick up PSP’s because it added multiplayer!! Tamarah and I love goin’ head to head on the plains of Ivalice.