14 Jul 2010 |

Service dependencies with NRPE

If you have defined services using the nrpe mechanism, you might know the following scenario:
The NRPE daemon fails and all services using it go critical. One first step to avoid these false alarms is to create an additional service which monitors the NRPE daemon itself (called check_nrpe_daemon in this example) and install a dependency between your services and check_nrpe_daemon.

From the dependency logic, this means that if one of these services fails, further checks and notifications for this service depend on the state of it’s parent service check_nrpe_daemon. Normally one would formulate the dependency in a way, that no notifications will be sent for the dependent services, if the nrpe daemon quit working. However, imagine the following scenario where services have max_check_attempts=2 <ul> <li>check_nrpe_daemon is checked and returns OK </li> <li>The NRPE daemon stops working </li> <li>A Service using nrpe is checked (check_nrpe!check_swap) and returns CRITICAL (SOFT:1) </li> <li>check_nrpe_daemon is checked and returns CRITICAL (SOFT:1) </li> <li>A Service using nrpe is checked (check_nrpe!check_swap) and returns CRITICAL (SOFT:2) </li> <li>check_nrpe_daemon is checked and returns CRITICAL (SOFT:2) </li> <li>A Service using nrpe is checked (check_nrpe!check_swap) and returns CRITICAL (HARD:1) </li> <li>A notification is sent out saying there is a problem with swap. </li> <li>check_nrpe_daemon is checked and returns CRITICAL (HARD:1) </li> <li>A notification is sent out saying there is a problem with nrpe. </li> </ul> This is what we wanted to avoid. You can set the flag soft_state_dependencies but it won’t help in any case. What we need is an immediate check of the parent service if a dependent service fails. A new command force_nrpe_check is defined and used as en event handler for the dependent services. force_nrpe_check forces scheduling of check_nrpe_daemon.

define command {
    command_name       force_nrpe_check
    command_line       $USER1$/force_nrpe_check
}
define service {
    service_description  check_swap
    command_line         check_nrpe!check_swap
    event_handler        force_nrpe_check
    ....

And this is the source code of the force_nrpe_check eventhandler script:

#!/bin/sh
#
#  The name of the service which checks check_nrpe_daemon
#

NRPE_SERVICE=serviceprofile_os_hpux_common_check_agent

case "$NAGIOS_SERVICESTATE" in
  OK)
    # no need to care for nrpe health
    ;;
  WARNING)
    # check_nrpe does not exit with warnings.
    # So this exit code really comes from a remote check command
    ;;
  UNKNOWN|CRITICAL)
    if [ $NAGIOS_SERVICEATTEMPT -eq 1 ]; then
      export NAGIOS_NOW=$(date +"%s")
      # the reason for this error state might be a failed nrpe.
      # schedule a forced check of the check_nrpe_daemon service immediately
      printf "[%lu] SCHEDULE_FORCED_SVC_CHECK;%s;%s;%lu" \
          $NAGIOS_NOW $NAGIOS_HOSTNAME $NRPE_SERVICE $NAGIOS_NOW > $NAGIOS_COMMANDFILE
      # If the reason of our problem was really a problem with the
      # nrpe daemon, then check_nrpe_daemon service will change it's
      # state to soft;1
      # But the originally failed service is still 1 step ahead and
      # might reach a hard state before the check_nrpe_daemon
      # Therefore we need to force another check of check_nrpe_daemon to
      # raise its service attempt counter above the counter of the dependent service.
      printf "[%lu] SCHEDULE_FORCED_SVC_CHECK;%s;%s;%lu" \
          $NAGIOS_NOW $NAGIOS_HOSTNAME $NRPE_SERVICE $NAGIOS_NOW > $NAGIOS_COMMANDFILE
    fi
    ;;
esac
exit 0

Author: Gerhard Laußer

Tags: Nagios, nrpe

Categories: nagios