Cuddletech | IO Benchmarking: How, Why and With What

IO Benchmarking: How, Why and With What

Posted on May 22, 2007

I’ve been meaning to bring up the dreaded benchmark topic for some time. Benchmarking is like some type of taboo… everyone wants to but you get chastised if you talk about it. That, of course, only fuels the desire that much more. But how? What do you use? When do you use what tool? Its really hard to find good information and so here is my attempt to shed some light on things.

Before we begin, lets look at the word itself: benchmark. The term comes from wood working and other such crafts. When a craftsman is making, for instance, table legs you first carefully cut your wood to size, measuring for exactness. Once it’s done properly you can then take that funny flat pencil, make a pencil mark on your bench and then use that to measure up all the following pieces to speed the process along. Based on the mark you know if your other pieces are too long or short and make the needed alterations. The mark is not a measurement per se, but rather a quick method of judging difference between things that should be similar.

Now we move into the storage world, the way in which we put that mark on our bench is different, but the purpose is exactly the same. Benchmarks are not about judging how fast something is! They are for judging change! Let me make that crystal clear… if you want to run some piece of software, a benchmark suite, against some peice of storage and definitively say “Its X fast.” you will always be wrong and people will always mock your numbers. The reason for this is self evident, any performance you see is based on the total environment in which it is run which can almost never be duplicated perfectly. In the storage world there are countless variables that mix things up: cabling, HBA’s, firmware, disks, disk firmware, OS, drivers, CPU’s, memory, and on and on. Just a single rev of firmware on an HBA or BIOS can have a major impact on your performance, so just because one person can get 180MB/s sequential writes out of his 3Par doesn’t mean that your going to get that same number.

Despite all that common sense stuff, we all still want one magical number that describes our l337 setup. But those numbers don’t work either because of various factors. Is the data sequential or random? Read, write or both? Whats the mix? Big files or small? No handful of numbers will make it clear. All these things contribute to the “benchmarks are useless” flamewars that occur all the time.

So, what then? Don’t benchmark? Stay oblivious to your performance thresholds and just rely on performance counters? Lets get real, we all want to know what a given solution can do and so even if we’re afraid to tell anyone what our numbers are we still do them.

The following is a rundown on the various methods of benchmarking that I find to be common and that I myself use, including when to use them, how and why.

dd

Do you know how benchmarks really work? Seriously, do you? Or do you just download some software, run it, and trust it? Everyone should start with our old friend dd. Using it is simple, read from here and write to there in some block size. This is perfect for determining sequential performance and getting a baseline “how fast can I read? how fast can I write?” sort of picture.

Start with a small sequential write block size, like this: time dd if=/dev/zero of=/mystorage/testfile bs=8k

Let that run for about 30 seconds. While it runs, watch the IO using tools like fsstat or iostat. When you hit 30 seconds stop it (^c) and do the math… n blocks transfered at 8K, so 8K times n, divide that by the exact time reported by the output, and presto we have number. Alternatively just divide the output file size by the time.

root@aeon ~$ time dd if=/dev/zero of=/iscsi/benchmark/testfile bs=8k   
^C
97810+0 records in
97810+0 records out

real    1m52.933s
user    0m0.048s
sys     0m2.487s
root@aeon ~$ 
root@aeon ~$ ls -lh /iscsi/benchmark/testfile
-rw-r--r--   1 root     root        764M May 22 03:42 /iscsi/benchmark/testfile
root@aeon ~$ bc
60+52.9  
112.9
764/112.9
6

Now, do it again, but this time with a 32k block. Then with a 128K block, a 512K block, 1MB block, 8MB block. Does the number change? How does it change? All this is important and something that storage admins rarely understand well untill they do this kind of exercise. This is exactly why I think every one should start with dd. Once you understand sequential performance at various block sizes, turn it around and do it all again but this time reading from a file to /dev/null. How does that look?

These types of tests provide very good ideas of what your best possible sequential performance is, and since most storage preforms sequential IO best this is normally the number I use for my “My Thumper can push line speed!” type statements to management.

But then lets be reminded right here and now…. real world applications almost never do huge quantities of sequential reads or writes. Its normally either very random or at best a mix of random and sequential performance. Think of it like this… Which is faster, a Top Fuel Dragster or a Formula1 car? The dragster is the fastest, but it can’t turn… the forumla1 care can turn like crazy but has to slow down for the corners. This is, imho, a good way to think of sequential vs random IO. If you want to see this in action watch The 24 Hours of Le Mans, where huge CanAm cards scream at 225MPH down the straight but then get passed by Porsche 911’s in the corners.

dd is also handy for testing raw devices. Benchmark /dev/rdsk/c1t1d0s2, then create a filesystem on it and benchmark using a file on that. Do tests against the charector device (/dev/rdsk) and block device (/dev/dsk) without a filesystem. On and on, there are lots of fun tests you can do with this simple tool that will help you better understand the underlying inner workings.

If you want to really get into it pick a single disk, download the spec sheet for the drive model from the manufacturer, and compare your numbers. Then partition the disk so that there is a 200MB partition at the beginning of the disk (starting at sector 0) and a 200MB partition at the end, and then re-run your tests against those two partitions. Very quickly you’ll find that storage topics that might have been abstract or foggy start becoming very clear, and terms like “sector”, “zoning”, “cylinder”, “full bore stroke”, “seek time”, etc, all start making a lot more sense because you can measure these things without some benchmarking tool abstracting all those details away from you.

One word of caution, if you write data to a filesystem with compression using /dev/zero you might see unbelieving high numbers thanks to smart compression. In these cases use /dev/urandom.

IOZone

IOZone can be your best friend or your worst enemy. It gives you piles of output, but what does it all mean?

root@aeon ~$ iozone -a -f /iscsi/benchmark/testfile
 (snip)
                                                            random  random    bkwd  record  stride                                   
              KB  reclen   write rewrite    read    reread    read   write    read rewrite    read   fwrite frewrite   fread  freread
              64       4  100782  392899  1066277  1330599 1016473  424101  418281  329860  942552   281899   342403  470880   809281
              64       8  149508  561495  1308191  2210490 1779559  639776  819484  429479 1359677   333377   556799  853867  1206279
              64      16  224481  726304  1773010  3187118 2666736  902870  954801  512190 1681127   460485   615747  985460  1489957
              64      32  336690  865556  1186173  3984255 3213648  968640 1000206  640345 1394066   321463   593094  889313  1458013
              64      64  640728  984604  2065562  3788700 3194489 1048299 1230654  914748 1162367   374341   499802  581333  1392128
             128       4   48376  479597   794708  1293454 1075853  465638  687865  412859  653222   348679   407500  612489   921157
             128       8   95384  680542  1076199  2067003 1775777  703634  883158  636917 1540598   460509   566358  735581  1293810
             128      16  180520  920588  1112754  2970023 2615679  962219 1085347  790598 1968806   511767   684483  795485  1544370
             128      32  289588 1067558  1346983  3287511 2777119 1196016 1142954  752991 1910738   508116   735558  770868  1599215

That output can be confusing, but if you used our dd method above you suddenly realize that what IOZone is doing for you is automating all sorts of metrics on your behalf. Above we see that we’re creating a 64KB file in block sizes of 4, 8, 16, 32, and 64, doing read and write tests as well as fread/fwrite/etc (binary stream), and random (“read” and “write” are sequential).

IOZone, therefore, is great for finding that sweet spot on your storage solution. You can find out how far you can push your caches before they are ineffective or see what the optimal usage pattern is for your solution. Based on that, you can tweek caching or presets and see exactly how the performance range shifts in reaction.

When using IOZone don’t get caught up with all its fancy options, -a pretty well has you covered. Make sure that ever test run is done identially.

Now… before we move on I should address the Cache problem. If you are doing these types of tests where you hit a file over and over again you want to ensure that your filesystems caching isn’t over-inflating your numbers. This is especially true when you want to see the performance of a disk solution and not the filesystem. In these cases you should either bypass the cache by using the raw character device (/dev/rdsk) or by mounting the filesystem with DirectIO (mount -o forcedirectio,…). Some benchmarks attempt to defeat the cache by allocating files larger than memory, but on a box with 16GB of memory thats a bit ridiculous.

Bonnie++

Bonnie++ has become very popular because it is that magical benchmark that plops out a nice number that you can print on a shirt. Because of this, its both loved and hated. Either way, it’s the de facto grand-daddy of macro IO benchmarks.

root@aeon ~$ bonnie++ -u benr -d /iscsi/benchmark/ -r 512
(snip)
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
aeon             1G  7594   7  8317   2  2475   0 10840  13 11139   1  74.2   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 28164  97 +++++ +++ +++++ +++ 17849  94 +++++ +++ +++++ +++

I think Bonnie++ is a great tool for getting a general feel of a given configuration. Lots of data provided by tools like IOZone help you understand performance in a variety of scenarios, whereas Bonnie++ glosses over the details and gives you a handful of numbers. For this reason I recommend Bonnie++ when making non-tuning changes, such as upgrading your Fibre Channel HBA or running a new kernel or driver.

Remember, macro benchmarks are just that, high level, wide scope measurements. When used just see see if some change had an impact, its a great friend to have available.

As an aside, am I the only one that would rather they said “Write” instead of “Output” and “Read” instead of “Input”? Thats always annoyed me.

One thing to be careful about when using Bonnie++ is its need to write twice as much data as you have RAM. Thats an attempt to eliminate cache effect. However, if you have 1GB or more of memory that can really slow things down. I recommend disabling caching yourself (mount with ‘forcedirectio’) and using a more sensable size for your test, such as 1GB.

FileBench

FileBench is an extremely powerful tool for both macro and micro benchmarking. Its great strength is in its flexibility. Rather than running standard “as fast as you can!” tests it processes a given defined “workload”. Several standard workloads are provided with FileBench, and you can easily extend or modify them to suite your specific needs.

Lets be more clear. Other tools will tell you how fast you ran this op or that, but specific tools need to be written to do more complex testing. FileBench provides a framework so that you can avoid writing your own tests in C or PERL or whatever. These workloads follow patterns of IO and thus better represent real world conditions. Just knowing that writing is fast isn’t good enough, you really want to model activity such as a mail server that does a bunch of getattrs, one or more opens, some writes, some appends, a close, then repeats. All this is just a simple workload in FileBench. Here is an excerpt of the “varmail” workload (varmail.f):

define process name=filereader,instances=1
{
  thread name=filereaderthread,memsize=10m,instances=$nthreads
  {
    flowop deletefile name=deletefile1,filesetname=bigfileset
    flowop createfile name=createfile2,filesetname=bigfileset,fd=1
    flowop appendfilerand name=appendfilerand2,iosize=$meaniosize,fd=1
    flowop fsync name=fsyncfile2,fd=1
    flowop closefile name=closefile2,fd=1
    flowop openfile name=openfile3,filesetname=bigfileset,fd=1
    ...

This above IO pattern is mimics what you’ll really see in the world. Thus, FileBench allows us to duplicate IO patterns in the lab without having to reproduce the exact conditions (ie: seeing up a mail server and hitting it just to examine IO behaviour). Lets see a run in action:

root@aeon bin$ ./filebench 
filebench> load varmail
filebench> run
 3000: 33.011: Fileset bigfileset: 1000 files, avg dir = 1000000.0, avg depth = 0.5, mbytes=15
 3000: 33.260: Creating fileset bigfileset...
 3000: 59.883: Preallocated 812 of 1000 of fileset bigfileset in 27 seconds
 3000: 59.884: Creating/pre-allocating files
 3000: 59.884: Starting 1 filereader instances
 3001: 59.897: Starting 16 filereaderthread threads
 3000: 60.903: Running...
 3000: 122.423: Run took 60 seconds...
 3000: 122.435: Per-Operation Breakdown
closefile4                 32ops/s   0.0mb/s      0.0ms/op        7us/op-cpu
readfile4                  32ops/s   0.5mb/s      0.0ms/op       49us/op-cpu
openfile4                  32ops/s   0.0mb/s      0.1ms/op       46us/op-cpu
closefile3                 32ops/s   0.0mb/s      0.0ms/op        8us/op-cpu
fsyncfile3                 32ops/s   0.0mb/s    270.9ms/op      156us/op-cpu
appendfilerand3            32ops/s   0.5mb/s      0.1ms/op       68us/op-cpu
readfile3                  32ops/s   0.5mb/s      0.0ms/op       49us/op-cpu
openfile3                  32ops/s   0.0mb/s      0.1ms/op       46us/op-cpu
closefile2                 32ops/s   0.0mb/s      0.0ms/op        8us/op-cpu
fsyncfile2                 32ops/s   0.0mb/s    229.0ms/op      151us/op-cpu
appendfilerand2            32ops/s   0.5mb/s      0.0ms/op       47us/op-cpu
createfile2                32ops/s   0.0mb/s      0.1ms/op       76us/op-cpu
deletefile1                32ops/s   0.0mb/s      0.1ms/op       60us/op-cpu

 3000: 122.435: 
IO Summary:      25009 ops 414.1 ops/s, (64/64 r/w)   2.0mb/s,   1571us cpu/op, 124.8ms latency
 3000: 122.435: Shutting down processes

You can see that we get a really nice breakdown of stats per operation. We can see clearly from the above that fsyncfile is my pain point.

My only caution is the avoid reading too much into the IO Summary. In too many cases I’ve found it to be misleading.

Personally, I look at FileBench not as a “benchmark” in the traditional sense, but rather a workload generator. Its very good at generating IO patterns. You can then use other tools such as “iostat”, “vmstat”, DTrace, etc, to view the impact a given workload exerts on your configuration.

Network Testing for IP Storage

When you test IP Storage technologies, such as NFS, AFS, or iSCSI you’ll want to measure the speed from here to there. The best tools I’ve seen for this are PathLoad and PathRate. Both of these tools have a “sender” and “receiver” component which work together on the ends of the connection testing, thus you need access to both systems. Whenever you seriously benchmark iSCSI or NFS you should first benchmark your network so that you know how much network throughput you really have.

Pathrate is an end-to-end capacity estimation tool. It can help you identify any bottlenecks along the network path to determine what the true speed is going to look like through the narrowest portion of a given link.

# ./pathrate_rcv -s 172.16.165.18
        pathrate run from 172.16.165.18 to z00001AT on Thu Dec 21 16:44:25 2006
        --> Average round-trip time: 0.3ms

-------------------------------------------------
Final capacity estimate :  964 Mbps  to  964 Mbps
-------------------------------------------------

Pathload is a tool for estimating the available bandwidth of an end-to-end path. “The available bandwidth is the maximum IP-layer throughput that a flow can get in the path from S to R, without reducing the rate of the rest of the traffic in the path.”

# ./pathload_rcv -s 10.71.165.18
Receiver z00001AT starts measurements at sender 10.71.165.18 on Thu Dec 21 17:17:11 2006
  Interrupt coalescion detected
Receiving Fleet 0, Rate 1200.00Mbps
Receiving Fleet 1, Rate 600.00Mbps
Receiving Fleet 2, Rate 923.08Mbps
Receiving Fleet 3, Rate 1090.91Mbps
Receiving Fleet 4, Rate 1000.00Mbps
Receiving Fleet 5, Rate 961.54Mbps

        *****  RESULT *****
Available bandwidth range : 923.08 - 1090.91 (Mbps)
Measurements finished at Thu Dec 21 17:17:16 2006
Measurement latency is 4.84 sec

These two tools when used together can be invaluable when benchmarking storage over IP networks.

Pulling It Together

There are lots of tools, but I hope you can see that no one tool is right or wrong, they each serve a different purpose and contribute to your understanding of a given configuration or solution in unique ways. Never pigeon hole yourself into relying on just one. I’m not going to pretend that mastering the various tools is easy… its not! Learning the tools takes a lot of time and effort, but if you are setting out to benchmark a solution you obviously have some desire to understand it. Resist the urge to react based on a single value or measurement and instead take advantage of the variety of helpful analysis tools available and utilize their strengths to assist you in conjunction with your standard tools provided by the OS or storage solution.