Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Current »

Monitoring Guide

This document outlines usage of Nagios to monitor the health of the Bruce replication system. A familiarity with administering Nagios is presumed of the reader.

Monitoring for the Bruce daemon

The check_procs Nagios plugin is used to monitor for one (and only one) Bruce daemon running on a server. The check_procs plugin is included with the "Official Nagios Plugins", and is likely already installed on a running Nagios system.

The steps to monitor for a Bruce daemon on the same host as Nagios:

  • Setup 'check_bruce_daemon' command.
# 'check_bruce_daemon' command definition
define command {
       command_name     check_bruce_daemon
       command_line     $USER1$/check_procs -w 1:1 -c 1:1 -a com.netblue.bruce.Main
}
  • Setup a service that will check for the replication daemon
# Define a service that will check for the replication daemon

define service {
       use                              local-service           ; service template to user
       host_name                        localhost
       service_description              bruce daemon
       check_command                    check_bruce_daemon
}
Monitoring for a database

In this example, we are monitoring for the presence of a bruce_config database. The distribution included check_pgsql plugin is used:

  • Setup check_pgserver command
# 'check_pgserver' command definition
define command {
       command_name     check_pgserver
       command_line     $USER1$/check_pgsql -H $HOSTADDRESS$ -P $ARG1$ -d $ARG2$
}
  • and set up a service to monitor for the database
# Define a service to monitor for the bruce_config postgresql server and database

define service {
       use                              local-service           ; service template to use
       host_name                        localhost
       service_description              bruce_config DB
;                                                      port database name
       check_command                    check_pgserver!5432!bruce_config
}
Monitoring for replication lag

The custom check_bruce_lag plugin is used in this monitor. To create this plugin, change directories to the Nagios plugins directory (often either /usr/lib/nagios/plugins or /usr/local/nagios/libexec), and execute this script:

cat >check_bruce_lag <<EOF
#!/bin/bash

export HOST=$1
export TYPE=$2 # 'MASTER' or 'SLAVE'
export CLUSTER=$3
export PORT=$4
export DB=$5
declare -i warn crit lag
warn=$6
crit=$7

lag_rc=-1
if [ "$TYPE" == "MASTER" ] ; then
    lag=`psql -t -h $HOST -c "select round(extract(epoch from now()) - extract(epoch from update_time)) from bruce.snapshotlog where current_xaction = (select max(current_xaction) from bruce.snapshotlog)" -p $PORT $DB`
    lag_rc=$?
fi
if [ "$TYPE" == "SLAVE" ] ; then
    lag=`psql -t -h $HOST -c "select round(extract(epoch from now()) - extract(epoch from update_time)) from bruce.slavesnapshotstatus where clusterid = $CLUSTER" -p $PORT $DB`
    lag_rc=$?
fi

if [ "$lag_rc" != "0" ] ; then
   printf "UNKNOWN error $lag_rc\n"
   exit 3
fi

if (($lag>=$crit))
then
    printf "CRIT Bruce Replication Lag %ds\n" $lag
    exit 2
fi

if (($lag>=$warn))
then
    printf "WARN Bruce Replication Lag %ds\n" $lag
    exit 1
fi

printf "OK Bruce Replication Lag %ds\n" $lag
exit 0
EOF
chmod a+x check_bruce_lag
  • Setup a command for monitoring
# 'check_bruce_lag' command definition
define command {
       command_name     check_bruce_lag
       command_line     $USER1$/check_bruce_lag $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ 60 300
}
  • Service to monitor a MASTER database
# Define a service to monitor for snapshot lag on the bruce_master database

define service {
       use                              local-service           ; service template to use
       host_name                        localhost
       service_description              bruce_master lag
;                                                           clusterid
;                                                       type   | port database name
       check_command                    check_bruce_lag!MASTER!1!5432!bruce_master
}
  • Service to monitor a SLAVE database
# Define a service to monitor for snapshot lag on the bruce_slave_02 database

define service {
       use                              local-service           ; service template to use
       host_name                        localhost
       service_description              bruce_slave_02 lag
;                                                         clusterid
;                                                       type  | port database name
       check_command                    check_bruce_lag!SLAVE!1!5432!bruce_slave_02
}
  • No labels