Monitoring with Zabbix

Posted on December 17, 2007

With a new site comes new opportunity. When Joyent recently added a new facility I decided to look at new monitoring solutions. We previous were standardized on Sun X4100, Sun T1000’s, and Sun X4500’s with a network using Force10 and F5. This new site would replace the X4100’s with Sun X4150’s and Dell 2950’s. Having largely avoided Dell in the past I took my time to get to know the system well and knowing that IPMI is far more utilized in that arena than with most Sun systems I was given the oppertunity to get a whole new appreciation for IPMI 2.0. Armed with this new knowledge I wanted to take our monitoring much further than we had in the past to monitoring individual BMC sensors via IPMI. This meant looking at monitoring solutions in a much deeper way.

Zabbix quickly rose to the top of my list. In the initial phase I grabbed Zabbix and Zenoss and planned for a face off. Zabbix compiles nicely and easily on Solaris/X86 and within about 30 minutes I had a server up and I created an agent tarball complete with installation script so that deploying an agent was as easy as a wget, untar, and “install.sh”. Zenoss, however, gave me problems, dealing with Python isn’t my strong point and several dependancies were required, and then I had problem after problem getting things to build properly… after about an hour I decided that as pretty as Zenoss looks in those screenshots it was out of the running for the time being.

Zabbix isn’t pretty… I’ll be honest. If you want a “tactical view” to put on an overhead projector in your NOC, Zabbix won’t give you that Hollywood movie feel. Zabbix does make up for that short coming in raw power. Let me explain…

Zabbix is agent based, you can choose to avoid this but you’ll loose most of what makes Zabbix so great. Is implemented in C and easily portable, so you don’t need to worry about having Java or Python on all the monitored hosts like many of its competitors. It adds to this the standard assortment of SNMP, custom script, and external checks (icmp, ftp, etc) you expect. Where it adds something really interesting is its WEB (why they put this in all caps I don’t know) monitoring capabilities which allow you to do more than just fetch a page but actually supply “steps” such as logging into a site and navigating around, to which response times are stored allowing you to alert someone if the login process takes more than 5 seconds or something. Very handy indeed.

Zabbix agents and other checks associate data with “keys”, these keys are then bound to “items” that describe the data and define how often to update the given data. These items are then associated with various alert conditions called “triggers”. For instance, the agent by default returns the number of users connected with the key “system.users.num”, and the stock template associates the “Number of users connected” item with that key which is polled every 60 seconds. So now we can create any number of different alert conditions based on the number of users logged in by creating triggers. One default trigger is “Too may users connected on server {HOSTNAME}” (where HOSTNAME is replaced ultimately with the appropriate host), which uses the following expression/condition:

{Template_Joyent:system.users.num.last(0)}>50

In this case, the trigger is associated with my “Template_Joyent”. The function “last(0)” (meaning the last value you polled) is applied to the key “system.users.num”, and if the value is greater than 50 the condition is true. In Zabbix all conditions should evaluate to false when things are fine. Each trigger has a Severity associated with it, in this case “Average”. So the result here is that any user configured to get alerts for Average or higher severity will get a notification when more than 50 users are logged in.

Zabbix provides a rich set of functions by which to create your triggers, such as average change over time, min and max, absolute difference, etc. This means that I could, for instance, create a trigger that alerted me if since the last polling interval more than 10 users logged in or if the average number of users over the course of an hour exceeded 50.

Where this becomes really powerful is when you choose to extend your agents. In each agents configuration file you can supply a “UserParameter” directive which runs some command and returns it with a given key. Here are some simple examples:

# SMF
UserParameter=smf.online,svcs -a | grep online | wc -l
UserParameter=smf.offline,svcs -a | grep offline | wc -l
UserParameter=smf.maintance,svcs -a | grep maint | wc -l
# X4150 IPMI (BMC Direct)
UserParameter=ipmi.amb,/usr/sbin/ipmitool sensor reading "Ambient Temp0" | cut -f2 -d|

UserParameter takes two arguments which are comma delimated: the key name and the command to run. In the above, I can return the number of “online” SMF Services by grepping out of “svcs -a” and then returning that as “smf.online”. Restart the agent, go back to the server and add a new Item for this key and start creating triggers. Now I can be alerted if, for instance, the number of online services decreases by some number in a given time. The IPMI example there provides a workaround for environments where you may not have access to the management port of your server and instead want to return IPMI data directly from the OS using ‘ipmitool’.

These examples above are simplistic, you can get more advanced by allowing the server to pass arguments. For example, if you want to monitor disk ops, you probably don’t want to add a separate key for each disk, thus you could specify the disk name back on the server which is passed to your agent as an argument.

These abilities remove a lot of the cruft and put the power into your hands. If you have a highly customized environment Zabbix is a great choice.

However, in Zabbix 1.4 there is a lot more that is needed. Currently features like escalations are planned but not expected till Zabbix 1.6. Without features like limiting repeat pages your forced to get your triggers properly defined rather than masking away false-positives in your alerting policy, such as the way Nagios handles “flapping”.

The “Overview” (all in one view of the world) page is something that takes some getting used to. Rather than a pretty page in black with “0 Services Down, 0 Servers Down” Nagios style page you instead get a list of defined triggers and a column for each monitored host which is color coded. If the box is green, life is good. If the box is red, life isn’t. Several nifty things like flashing green if the trigger is fine now but wasn’t less than 15 minutes ago are handy but sometimes annoying especially during testing.

But, and this is a big but, Zabbix does give you something exceedingly useful… all the monitored keys can be graphed. Click on the color block of any monitored trigger and you can view its value history as a graph. This means that you can take add-on graphing applications like Cacti or MRTG and roll the functionality directly into Zabbix. Want to know what load average has looked like for the last week? No problem. And, add to this the ability to custom create graphs which can combine multiple keys into a single view.

If you see a screenshot with pretty graphs all over it on the Zabbix front page, thats a “Screen”. A screen is a page custom layed out with several custom graphs. So you might want a graph that has the memory usage and load average of every system on your network, you can put those together in a custom graph and then place that on your “Screen”. If you take some time and create some nifty graphs like this you may find ourself looking more at your screen than at the Overview page.

Isn’t as pretty as Zenoss or as streamlined as Nagios, but it really is a SysAdmins tool. It doesn’t hide functions away from you or gloss over details to make you feel nurtured, its all out there gritty and raw for you to use an manipulate. There is a nature learning curve in wrapping your head around the concepts of “items” and “triggers” and how you can combine them in really powerful ways, but before long you’ll be frustrated by the limits of other solutions. Having a full featured and easily extensible agent really is my favorite aspect and frees you from the concerns that come with having to pass around SSH Keys required to make external scripts work with other solutions.

That said, if you decide to implement Zabbix expect to spend some time crafting triggers to suite your needs. If you want to get beyond the basics you’ll need get your hands dirty, which is easy to do after playing with it a bit, but it you want something you can just deploy and forget consider a commercial tool like Uptime.

I’m only grazing the basics here. If you want to learn more check out Zabbix.com.

UPDATE: Screenshots added per comment request.