nacheckfilers

This PERL script can be used in 2 modes: reporting and paging. In reporting mode (-report) the script accepts a list of Network Appliance Filer names or IP addresses and gathers information about them, specifically disk failure information. In paging mode (-page) it will run tests for failed componants and if found send email to a list of recipients, which can include pagers that accept email (SkyTel). The tests are done using the NetApp MIB and Net-SNMP's PERL module.

The following is the output of the script in reporting mode:

# ./nacheckfilers.v2.04  -report alice ralph judy
Filer alice is UP.
Total Disks: 84 | Active: 82 | Reconstructing: 0
Spare Disks: 2 | Failed: 0 -> There are no failed disks.
Failed Fans: 0 | Failed Power Supplies: 0

----------------------------------
Filer ralph is UP.
Total Disks: 84 | Active: 82 | Reconstructing: 0
Spare Disks: 2 | Failed: 0 -> There are no failed disks.
Failed Fans: 0 | Failed Power Supplies: 0

----------------------------------
Filer judy is UP.
Total Disks: 42 | Active: 40 | Reconstructing: 0
Spare Disks: 2 | Failed: 0 -> There are no failed disks.
Failed Fans: 0 | Failed Power Supplies: 0

----------------------------------
#

Something special about this script is that there is a hard coded minimum number of spares that must be found, in this case 2. This script was written after I took control of monitoring our NetApp Filers, and I found that a large number of disks were going offline but weren't AutoSupporting (the NetApp Filers should automatically send mail when a part fails). The only way to find failures was to log in and run a "sysconfig -r" manually. I however noticed that while AutoSupport didn't notice and report the problem, SNMP was giving me a clue. When disk would "slip" offline without reporting the failure, it would decrement the SpareDisk SNMP counter, but NOT decrement the TotalDisks counter! Thus, this script was evolved to solve the problem. It runs a series of tests comparing various disk counters, and if all of them pass (ie: there shouldn't be a problem) I then check the number of spares against the hardcoded minium. If all these tests pass, no action is taken. However, if one of these tests do not pass, the email addresses specified are notified of the fault.

When used in paging mode, it is intended to be called from cron. When a failure is detected it tags a file in /tmp. When the script next runs if a failure is again detected it checks the /tmp file before paging, if you already have been notified it takes no action. If however it passes all checks and the tmp file exsists (ie: you fixed it) it will send an "All Clear" page for that filer. This serves 2 purposes, 1) if you are on vacation you don't want to get paged every 15 minutes untill you get back to fix it (trust me, this happened for a week to me!), and 2) If someone else fixes the problem and doesn't notify you it'd be best to know before going on-site for no reason.



Download: nacheckfilers.v2.04
Requirements: NetApp MIB*, NetSNMP 5.0.x and PERL module**
License: GNU Public License

*NOTE: Download the NetApp MIB and copy it into Net-SNMPs system MIB directory, usually: /usr/local/share/snmp/mibs/
**NOTE: The script currently requires Net-SNMP 5.0.2.pre1. If you use a diffrent version of Net-SNMP change the "use SNMP" line to reflect that.