Digging Deeper: Systems Management

Posted on December 6, 2007

I’ve recently been delving deeper into the area of systems management. At Joyent we have a variety of unique problems that I’ve not faced in the past. For instance, I have a largely non-heterogeneous environment, we have strict controls on what we use and always strive to be as consistent as possible. We also have more machines than I’ve ever managed before, while I’ve worked in many very large environments I typically worked on big iron systems with organizations that divided administrative duties across multiple teams. This all collides into a perfect test bed for some really deep systems management practices.

So what is Systems Management? I think we can divide data center duties into 4 main groups:

  • Facilities & Infrastructure: Physical environment control; HVAC, Power, Racks, Cable Management, etc.
  • Network Administration: Managing the connections between systems and to the outside world.
  • Systems Administration: Managing the OS’s and application environment that runs on each system.
  • Application Administration: Various specializations for applications, Mail Admins, DBA’s, etc.

In between these layers lives Systems Management.

To make an analogy… ‘root’ may be omni-potent, having “God Power”, but unlike God, a SysAdmin is not omni-present. You may be the sole administrator of a 50 systems installation, but do you really know the temps that the CPU’s are running at? Do you know the load average of every system? Do you know if a disk has failed? Sure, you can find out right now, but what were those data points 2 days ago? A week ago?

The core focus of systems management, to me, is making an attempt to be omni-present. God has the ability to know all things at one time… we don’t, so the critical elements are:

  • Determining a complete list of all data points at all layers of the stack that provide some insight into operations
  • Determining the meaning and interpretation of those data points by which to programatically direct actions

This second point is vital and one of the reasons that so many people struggle with DTrace… knowing what question to ask tends to be self evident with a little thought, and determining how to ask that question has become easier with tools like DTrace or SNMP, but WTF do those numbers really mean? A classic sysadmin interview question is to drop some printed vmstat output in front of a candidate and ask for an analysis on the spot.

Let me give you an example. At Joyent we’ve been learning a lot about modern storage, despite the advances in CPU and Memory technology, disk technology hasn’t really come that far, we are largely in the same boat that we were 10 years ago. But how do you know what your storage solution just isn’t measuring up? How do you know when your in trouble? Look at the output of ‘iostat’. Most people incorrectly interpret the ‘%b’ column as “blocked”, saying, if its consistantly 100 %b then your in trouble. But thats not true… %b is “Busy”, meaning, during a given time interval how much of that time was IO actively occuring? Thus, 100% simply means that over that time interval you were doing IO to a given device for the entire period…. but that might be at 1MB/s or 200MB/s or may vary. Its similar to saying that if the CPU’s are 100% use there is a problem, which isn’t strictly true if the load average remains lower than the number of cores. A fundamental flaw in most peoples interpretation of data is that 100% utilization automatically means something is broken.. sure, thats almost always the case, but fundamentally its still wrong. So, in the case of IO, the better number to look at is the asvc_t column which reports the “Active Service Time”, or, the average time it took for an IO to be handled by the device from the time that it was sent to the device and then returned (time spent waiting to be sent to the device is the wsvc_t, wait time spent queued in the driver).

So how do I know if IO levels have dropped off to unacceptable levels? If asvc_t consistantly exceeds 50ms, like is probly unpleasant, and if it exceeds 100ms life is just unbearable. Thats the number that I’m really interested in because it most accurately describes a condition to which I want to be apprised… and this is where monitoring comes in.

Monitoring makes us omni-present. It allows us to be constantly aware of change in our environment to which we should be aware and to record data points over a period of time to allow for trend analysis, capacity planning, and fault isolation down the road. Thus, we can divide monitoring applications into 2 categories:

  • Data Collection & Reporting Applications. Examples: Cacti, MRTG
  • Threshold & Conditional Alerting Applications: Examples: Nagios, Big Brother

From these have emerged several hybrids that attempt to combine the two, such as Zabbix and Zenoss. Its not uncommon to see environments use multiple applications however, I commonly see environments with Ganglia, Nagios, and Cacti for instance. And, of course, there are specialty varieties, such as Snort or pmacct, that operate on specific types of data.

And so, when we move beyond network or systems administration we have to take a more holistic approach. Beyond just paging someone when a system crashes, but rather, asking questions and thinking deeply about what there is to know, how do you obtain that data, feed it to something and then build some intelligence around it to only bug you when you really should be bugged.

Modern servers provide us with a rich set of tools and capabilities to exploit. IPMI, for example, is not just for remotely rebooting systems! SMBus and sensors are not just for putting some nifty graphic in the corner of your desktop! Even on commodity systems boards now commonly include a Baseboard Management Controller (BMC) which forms the basis of most Lights Out Management (LOM) solutions, aggregating data together, collecting and making sensor data available via IPMI. What a list of componants in your system? What to know the temps and fan speeds in your box? What to see system events that may not have been seen or felt by your OS but occured non-the-less? IPMI talking to your BMC can help, even if you don’t have a Service Processor (SP) such as ILOM or DRAC.

In the next couple weeks I’m going to try to set aside some time to dig into some of these topics in more depth. I don’t think most administrators are truly aware of all the resources available to them, because unless you go looking for something or have a need or just stumble across it, how are you supposed to know? We’re all busy enough as it is without going looking for new things to learn. So thats where I can try to help out. Hopefully we’ll all learn some new tricks to add to our arsenal of magic.