IO Benchmarking: How, Why and With What
Posted on May 22, 2007
I’ve been meaning to bring up the dreaded benchmark topic for some time. Benchmarking is like some type of taboo… everyone wants to but you get chastised if you talk about it. That, of course, only fuels the desire that much more. But how? What do you use? When do you use what tool? Its really hard to find good information and so here is my attempt to shed some light on things.
Before we begin, lets look at the word itself: benchmark. The term comes from wood working and other such crafts. When a craftsman is making, for instance, table legs you first carefully cut your wood to size, measuring for exactness. Once it’s done properly you can then take that funny flat pencil, make a pencil mark on your bench and then use that to measure up all the following pieces to speed the process along. Based on the mark you know if your other pieces are too long or short and make the needed alterations. The mark is not a measurement per se, but rather a quick method of judging difference between things that should be similar.
Now we move into the storage world, the way in which we put that mark on our bench is different, but the purpose is exactly the same. Benchmarks are not about judging how fast something is! They are for judging change! Let me make that crystal clear… if you want to run some piece of software, a benchmark suite, against some peice of storage and definitively say “Its X fast.” you will always be wrong and people will always mock your numbers. The reason for this is self evident, any performance you see is based on the total environment in which it is run which can almost never be duplicated perfectly. In the storage world there are countless variables that mix things up: cabling, HBA’s, firmware, disks, disk firmware, OS, drivers, CPU’s, memory, and on and on. Just a single rev of firmware on an HBA or BIOS can have a major impact on your performance, so just because one person can get 180MB/s sequential writes out of his 3Par doesn’t mean that your going to get that same number.
Despite all that common sense stuff, we all still want one magical number that describes our l337 setup. But those numbers don’t work either because of various factors. Is the data sequential or random? Read, write or both? Whats the mix? Big files or small? No handful of numbers will make it clear. All these things contribute to the “benchmarks are useless” flamewars that occur all the time.
So, what then? Don’t benchmark? Stay oblivious to your performance thresholds and just rely on performance counters? Lets get real, we all want to know what a given solution can do and so even if we’re afraid to tell anyone what our numbers are we still do them.
The following is a rundown on the various methods of benchmarking that I find to be common and that I myself use, including when to use them, how and why.
dd
Do you know how benchmarks really work? Seriously, do you? Or do you just download some software, run it, and trust it? Everyone should start with our old friend dd. Using it is simple, read from here and write to there in some block size. This is perfect for determining sequential performance and getting a baseline “how fast can I read? how fast can I write?” sort of picture.
Start with a small sequential write block size, like this: time dd if=/dev/zero of=/mystorage/testfile bs=8k
Let that run for about 30 seconds. While it runs, watch the IO using tools like fsstat or iostat. When you hit 30 seconds stop it (^c) and do the math… n blocks transfered at 8K, so 8K times n, divide that by the exact time reported by the output, and presto we have number. Alternatively just divide the output file size by the time.
root@aeon ~$ time dd if=/dev/zero of=/iscsi/benchmark/testfile bs=8k ^C 97810+0 records in 97810+0 records out real 1m52.933s user 0m0.048s sys 0m2.487s root@aeon ~$ root@aeon ~$ ls -lh /iscsi/benchmark/testfile -rw-r--r-- 1 root root 764M May 22 03:42 /iscsi/benchmark/testfile root@aeon ~$ bc 60+52.9 112.9 764/112.9 6
Now, do it again, but this time with a 32k block. Then with a 128K block, a 512K block, 1MB block, 8MB block. Does the number change? How does it change? All this is important and something that storage admins rarely understand well untill they do this kind of exercise. This is exactly why I think every one should start with dd. Once you understand sequential performance at various block sizes, turn it around and do it all again but this time reading from a file to /dev/null. How does that look?
These types of tests provide very good ideas of what your best possible sequential performance is, and since most storage preforms sequential IO best this is normally the number I use for my “My Thumper can push line speed!” type statements to management.
But then lets be reminded right here and now…. real world applications almost never do huge quantities of sequential reads or writes. Its normally either very random or at best a mix of random and sequential performance. Think of it like this… Which is faster, a Top Fuel Dragster or a Formula1 car? The dragster is the fastest, but it can’t turn… the forumla1 care can turn like crazy but has to slow down for the corners. This is, imho, a good way to think of sequential vs random IO. If you want to see this in action watch The 24 Hours of Le Mans, where huge CanAm cards scream at 225MPH down the straight but then get passed by Porsche 911’s in the corners.
dd is also handy for testing raw devices. Benchmark /dev/rdsk/c1t1d0s2, then create a filesystem on it and benchmark using a file on that. Do tests against the charector device (/dev/rdsk) and block device (/dev/dsk) without a filesystem. On and on, there are lots of fun tests you can do with this simple tool that will help you better understand the underlying inner workings.
If you want to really get into it pick a single disk, download the spec sheet for the drive model from the manufacturer, and compare your numbers. Then partition the disk so that there is a 200MB partition at the beginning of the disk (starting at sector 0) and a 200MB partition at the end, and then re-run your tests against those two partitions. Very quickly you’ll find that storage topics that might have been abstract or foggy start becoming very clear, and terms like “sector”, “zoning”, “cylinder”, “full bore stroke”, “seek time”, etc, all start making a lot more sense because you can measure these things without some benchmarking tool abstracting all those details away from you.
One word of caution, if you write data to a filesystem with compression using /dev/zero you might see unbelieving high numbers thanks to smart compression. In these cases use /dev/urandom.
IOZone
IOZone can be your best friend or your worst enemy. It gives you piles of output, but what does it all mean?
root@aeon ~$ iozone -a -f /iscsi/benchmark/testfile (snip) random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 64 4 100782 392899 1066277 1330599 1016473 424101 418281 329860 942552 281899 342403 470880 809281 64 8 149508 561495 1308191 2210490 1779559 639776 819484 429479 1359677 333377 556799 853867 1206279 64 16 224481 726304 1773010 3187118 2666736 902870 954801 512190 1681127 460485 615747 985460 1489957 64 32 336690 865556 1186173 3984255 3213648 968640 1000206 640345 1394066 321463 593094 889313 1458013 64 64 640728 984604 2065562 3788700 3194489 1048299 1230654 914748 1162367 374341 499802 581333 1392128 128 4 48376 479597 794708 1293454 1075853 465638 687865 412859 653222 348679 407500 612489 921157 128 8 95384 680542 1076199 2067003 1775777 703634 883158 636917 1540598 460509 566358 735581 1293810 128 16 180520 920588 1112754 2970023 2615679 962219 1085347 790598 1968806 511767 684483 795485 1544370 128 32 289588 1067558 1346983 3287511 2777119 1196016 1142954 752991 1910738 508116 735558 770868 1599215
That output can be confusing, but if you used our dd method above you suddenly realize that what IOZone is doing for you is automating all sorts of metrics on your behalf. Above we see that we’re creating a 64KB file in block sizes of 4, 8, 16, 32, and 64, doing read and write tests as well as fread/fwrite/etc (binary stream), and random (“read” and “write” are sequential).
IOZone, therefore, is great for finding that sweet spot on your storage solution. You can find out how far you can push your caches before they are ineffective or see what the optimal usage pattern is for your solution. Based on that, you can tweek caching or presets and see exactly how the performance range shifts in reaction.
When using IOZone don’t get caught up with all its fancy options, -a pretty well has you covered. Make sure that ever test run is done identially.
Now… before we move on I should address the Cache problem. If you are doing these types of tests where you hit a file over and over again you want to ensure that your filesystems caching isn’t over-inflating your numbers. This is especially true when you want to see the performance of a disk solution and not the filesystem. In these cases you should either bypass the cache by using the raw character device (/dev/rdsk) or by mounting the filesystem with DirectIO (mount -o forcedirectio,…). Some benchmarks attempt to defeat the cache by allocating files larger than memory, but on a box with 16GB of memory thats a bit ridiculous.
Bonnie++
Bonnie++ has become very popular because it is that magical benchmark that plops out a nice number that you can print on a shirt. Because of this, its both loved and hated. Either way, it’s the de facto grand-daddy of macro IO benchmarks.
root@aeon ~$ bonnie++ -u benr -d /iscsi/benchmark/ -r 512 (snip) Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP aeon 1G 7594 7 8317 2 2475 0 10840 13 11139 1 74.2 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 28164 97 +++++ +++ +++++ +++ 17849 94 +++++ +++ +++++ +++
I think Bonnie++ is a great tool for getting a general feel of a given configuration. Lots of data provided by tools like IOZone help you understand performance in a variety of scenarios, whereas Bonnie++ glosses over the details and gives you a handful of numbers. For this reason I recommend Bonnie++ when making non-tuning changes, such as upgrading your Fibre Channel HBA or running a new kernel or driver.
Remember, macro benchmarks are just that, high level, wide scope measurements. When used just see see if some change had an impact, its a great friend to have available.
As an aside, am I the only one that would rather they said “Write” instead of “Output” and “Read” instead of “Input”? Thats always annoyed me.
One thing to be careful about when using Bonnie++ is its need to write twice as much data as you have RAM. Thats an attempt to eliminate cache effect. However, if you have 1GB or more of memory that can really slow things down. I recommend disabling caching yourself (mount with ‘forcedirectio’) and using a more sensable size for your test, such as 1GB.
FileBench
FileBench is an extremely powerful tool for both macro and micro benchmarking. Its great strength is in its flexibility. Rather than running standard “as fast as you can!” tests it processes a given defined “workload”. Several standard workloads are provided with FileBench, and you can easily extend or modify them to suite your specific needs.
Lets be more clear. Other tools will tell you how fast you ran this op or that, but specific tools need to be written to do more complex testing. FileBench provides a framework so that you can avoid writing your own tests in C or PERL or whatever. These workloads follow patterns of IO and thus better represent real world conditions. Just knowing that writing is fast isn’t good enough, you really want to model activity such as a mail server that does a bunch of getattrs, one or more opens, some writes, some appends, a close, then repeats. All this is just a simple workload in FileBench. Here is an excerpt of the “varmail” workload (varmail.f):
define process name=filereader,instances=1 { thread name=filereaderthread,memsize=10m,instances=$nthreads { flowop deletefile name=deletefile1,filesetname=bigfileset flowop createfile name=createfile2,filesetname=bigfileset,fd=1 flowop appendfilerand name=appendfilerand2,iosize=$meaniosize,fd=1 flowop fsync name=fsyncfile2,fd=1 flowop closefile name=closefile2,fd=1 flowop openfile name=openfile3,filesetname=bigfileset,fd=1 ...
This above IO pattern is mimics what you’ll really see in the world. Thus, FileBench allows us to duplicate IO patterns in the lab without having to reproduce the exact conditions (ie: seeing up a mail server and hitting it just to examine IO behaviour). Lets see a run in action:
root@aeon bin$ ./filebench filebench> load varmail filebench> run 3000: 33.011: Fileset bigfileset: 1000 files, avg dir = 1000000.0, avg depth = 0.5, mbytes=15 3000: 33.260: Creating fileset bigfileset... 3000: 59.883: Preallocated 812 of 1000 of fileset bigfileset in 27 seconds 3000: 59.884: Creating/pre-allocating files 3000: 59.884: Starting 1 filereader instances 3001: 59.897: Starting 16 filereaderthread threads 3000: 60.903: Running... 3000: 122.423: Run took 60 seconds... 3000: 122.435: Per-Operation Breakdown closefile4 32ops/s 0.0mb/s 0.0ms/op 7us/op-cpu readfile4 32ops/s 0.5mb/s 0.0ms/op 49us/op-cpu openfile4 32ops/s 0.0mb/s 0.1ms/op 46us/op-cpu closefile3 32ops/s 0.0mb/s 0.0ms/op 8us/op-cpu fsyncfile3 32ops/s 0.0mb/s 270.9ms/op 156us/op-cpu appendfilerand3 32ops/s 0.5mb/s 0.1ms/op 68us/op-cpu readfile3 32ops/s 0.5mb/s 0.0ms/op 49us/op-cpu openfile3 32ops/s 0.0mb/s 0.1ms/op 46us/op-cpu closefile2 32ops/s 0.0mb/s 0.0ms/op 8us/op-cpu fsyncfile2 32ops/s 0.0mb/s 229.0ms/op 151us/op-cpu appendfilerand2 32ops/s 0.5mb/s 0.0ms/op 47us/op-cpu createfile2 32ops/s 0.0mb/s 0.1ms/op 76us/op-cpu deletefile1 32ops/s 0.0mb/s 0.1ms/op 60us/op-cpu 3000: 122.435: IO Summary: 25009 ops 414.1 ops/s, (64/64 r/w) 2.0mb/s, 1571us cpu/op, 124.8ms latency 3000: 122.435: Shutting down processes
You can see that we get a really nice breakdown of stats per operation. We can see clearly from the above that fsyncfile is my pain point.
My only caution is the avoid reading too much into the IO Summary. In too many cases I’ve found it to be misleading.
Personally, I look at FileBench not as a “benchmark” in the traditional sense, but rather a workload generator. Its very good at generating IO patterns. You can then use other tools such as “iostat”, “vmstat”, DTrace, etc, to view the impact a given workload exerts on your configuration.
Network Testing for IP Storage
When you test IP Storage technologies, such as NFS, AFS, or iSCSI you’ll want to measure the speed from here to there. The best tools I’ve seen for this are PathLoad and PathRate. Both of these tools have a “sender” and “receiver” component which work together on the ends of the connection testing, thus you need access to both systems. Whenever you seriously benchmark iSCSI or NFS you should first benchmark your network so that you know how much network throughput you really have.
Pathrate is an end-to-end capacity estimation tool. It can help you identify any bottlenecks along the network path to determine what the true speed is going to look like through the narrowest portion of a given link.
# ./pathrate_rcv -s 172.16.165.18 pathrate run from 172.16.165.18 to z00001AT on Thu Dec 21 16:44:25 2006 --> Average round-trip time: 0.3ms ------------------------------------------------- Final capacity estimate : 964 Mbps to 964 Mbps -------------------------------------------------
Pathload is a tool for estimating the available bandwidth of an end-to-end path. “The available bandwidth is the maximum IP-layer throughput that a flow can get in the path from S to R, without reducing the rate of the rest of the traffic in the path.”
# ./pathload_rcv -s 10.71.165.18 Receiver z00001AT starts measurements at sender 10.71.165.18 on Thu Dec 21 17:17:11 2006 Interrupt coalescion detected Receiving Fleet 0, Rate 1200.00Mbps Receiving Fleet 1, Rate 600.00Mbps Receiving Fleet 2, Rate 923.08Mbps Receiving Fleet 3, Rate 1090.91Mbps Receiving Fleet 4, Rate 1000.00Mbps Receiving Fleet 5, Rate 961.54Mbps ***** RESULT ***** Available bandwidth range : 923.08 - 1090.91 (Mbps) Measurements finished at Thu Dec 21 17:17:16 2006 Measurement latency is 4.84 sec
These two tools when used together can be invaluable when benchmarking storage over IP networks.
Pulling It Together
There are lots of tools, but I hope you can see that no one tool is right or wrong, they each serve a different purpose and contribute to your understanding of a given configuration or solution in unique ways. Never pigeon hole yourself into relying on just one. I’m not going to pretend that mastering the various tools is easy… its not! Learning the tools takes a lot of time and effort, but if you are setting out to benchmark a solution you obviously have some desire to understand it. Resist the urge to react based on a single value or measurement and instead take advantage of the variety of helpful analysis tools available and utilize their strengths to assist you in conjunction with your standard tools provided by the OS or storage solution.