AFDB logo
mn :: comp :: tools :: The Nagios Network Monitoring System

Nagios Network Monitoring

Nagios is frankly not very good, but it's better than most of the alternatives in my opinion. After all, you could spend buckets of cash on HP OpenView or Tivoli and still be faced with the same amount of work to customize it into a useful state.... Among the free alternatives, in my experience Big Brother is too unstable to trust, which makes me loath to buy a license as required for a commercial use. Mon is quite good at monitoring and alerting, but it has all the same problems as Nagios plus a lack of sexy web GUI. I also don't like the way it handles service restoration alerts or blocking outages (dependencies) or multiple concurrent outages.

Did you say problems?

The things I don't like about Nagios are that grouping is very weak and configuration is a nightmarish glob of circular referential files. Once it's all done it's workable, but starting from scratch is needlessly difficult as everything you wish to do is defined in a file that you haven't written/modified yet. Additionally, you'll quickly find yourself reducing your groups of hosts to fewer and fewer machines because the exceptions become more and more numerous. In a perfect world, I should be able to use Boolean logic to define what gets checked how -- for instance "check_fping hostgroup local-network not switch4". I should be able to define hosts without defining services to go on them (e.g. a single service host). I should be able to edit the HTML interface without diving into the C code and recompiling.

My biggest complaint about Nagios is that you can't cluster checks -- imagine if you will a web farm of 20 servers behind a load-balancer. No one wants to get paged because one of the twenty crashed at 4AM, but you certainly want to know if ten or fifteen of the twenty are down. You can put a check_http on the host definitions for each of the twenty and use

        service_notification_commands   notify-by-epager,notify-by-email
        host_notification_commands      host-notify-by-email

in your contacts.cfg, and you can define a separate host and service for the load-balancer's virtual IP and thereby know if the whole site is down. But the only way to get paged for "some number greater than one and less than all of the servers are down" is to pull SNMP from the load-balancers or to run multiple checks, parse the answers, and return a single answer to Nagios. That one bugs me enough that I wrote a plugin, which can be downloaded here. You can also browse the source and the config file. Patches are very very welcome at the below address.

UPDATE: I've since discovered the check_cluster.c plugin, conveniently located in the contrib directory of the Netsaint source distribution or the contrib directory of the Nagios Plugins distribution (direct link here). It is not compiled by default and consequently will not be available in any RPM or .deb distribution. It works in the same fashion as mine, but it does not allow you to specify checks in the same way; rather it looks for what's been defined in the hosts.cfg or services.cfg. I think mine is better so it's staying available here.

ANOTHER UPDATE: Timothy Denike (tim at friendster dot com) provided a nice wrapper script for the check_cluster contrib program that will feed a hostgroup into it automagically. Download here.

Say, wouldn't it be nice to know if Nagios crashed or got shut off? Here's one way: my_nagios_check.pl. Put it in /etc/crontab like so:

        /30 * * * * root /usr/local/bin/my_nagios_check.pl

Last modified: Oct 24, 2008 2:28 pm.
Contact me.

Powered by Zope