Thursday 30 May 2013

Creating a Custom Nagios/Opsview Plugin

This was done using Nagios with Opsview front end version 4.2.3

Scripts are generally stored in the following location:-
 /usr/local/nagios/libexec 
Write the script in your language of choice and ensure it is runable by the nagios user.

The script must exit and create either an OK, WARNING, CRITICAL or UNKNOWN status, with a text based message which can be displayed by Nagios.

Exit Codes

  • 0 - This tells nagios that the check has passed and is OK
  • 1 - This tells nagios that the check has a problem but is just a WARNING
  • 2 - This tells nagios that the check has a problem that is CRITICAL
  • 3 - This generally means there has been a problem running the check it will display UNKNOWN in nagios

Bash Script

The script could be done in Bash if so you just need to echo the message out and then exit with the relevant code. For example if the check is okay and you wish to exit do the following:-
 echo "The check has passed"  
 exit 0  
The script could also be written in other languages for example the same above in Python:-
 import sys  
 print 'The check has passed'
 sys.exit(0)
The same above in Ruby:-
 puts "The check has passed"  
 exit 0  
Once the script is thoroughly tested it needs to be linked to nagios. You should be aiming for a script which takes a short amount of time to run. i.e. under 10 seconds, although the timeout can be extended if neccessary. Since this is a local check a config file needs to be put in the nrpe_local directory. The contents of this file could be something like the following:-
 cat /usr/local/nagios/etc/nrpe_local/new_check.sh  
The service will now need restarting.

Opsview Configuration

Go into Settings > Advanced > Service Checks. Add a new check, the two fields you are interested in are plugin which should be check_nrpe and arguments which should be:-
 -H $HOSTADDRESS$ -c new_check  
Note that this can be tested via on the host with the following:-
 /usr/local/nagios/libexec/check_nrpe -H `hostname` -c new_check

If this works it should work okay in Opsview. If there is a timeout problem -t can be appended to the above with a figure e.g. 30, the default is 10.
You then need to associate the check with the relevant hosts. To do this go to Settings > Basic > Hosts, search for the host, double click to amend, go to the monitoring tab, expand the relevant service group and select the new check.

After this you need to update the configuration. To do this go to Settings > Configuration > Apply Changes. Click Reload Configuration, then once this is reloaded and the check run you should see the result.