Nagios core High Availability (HA) on CentOS 8

Posted: November 5, 2019 in Linux

Keepalived is used for HA. Keepalived is a service that can monitor servers or processes in order to implement high availability on your infrastructure.

In this example Active-Passive HA is implemented. Nagios secondary will monitor connectivity to primary, when disruption is detected, Nagios secondary will start nagios and postfix service and will serve requests until nagios master is available, when connection to nagios master is restored, nagios secondary will stop nagios and postfix service. Both servers are reachable via keepalived virtual IP:192.168.0.30

Install rsync on nagios secondary:

dnf install rsync
systemctl start rsyncd && systemctl enable rsyncd

On both servers install keepalived

dnf install keepalived
systemctl start keepalived && systemctl enable keepalived

retention.dat file holds information about downtime, acknowledgement and comments. This file is read by CGI and shown in dashboard. This file (along with cfg files) will be regularly copied from master to slave nagios

On slave edit /usr/local/nagios/etc/nagios.cfg and set retention_update_interval=1 .It determines how often (in minutes) Nagios will automatically save retention data during normal operation (default is 60 minutes).

In /etc/keepalived create file exclude-list.txt to specify folder/files which don’t need to be synchronized with nagios slave

/etc/keepalived/exclude-list.txt

bin/
etc/cgi.cfg
etc/htpasswd.users
include/
libexec/
sbin/
share/
var/nagios.log
var/objects.cache
var/status.dat
var/archives/
var/rw/
var/spool/
var/spool/checkresults/

Keepalived config on Nagios master

set role to BACKUP, priority to 9, set virtual IP to 192.168.0.30 /etc/keepalived/keeplaived.conf:

! Configuration File for keepalived

global_defs {

    enable_script_security 1
    script_user root
   }
vrrp_instance VI_1 {
    debug 4
    interface eth0
    state BACKUP
    virtual_router_id 51
    advert_int 1
    priority 9
    virtual_ipaddress {
            192.168.0.30 dev eth0    # the virtual IP
       }
    unicast_src_ip 192.168.0.26 # Local IP
    unicast_peer {
      192.168.0.27 # Peer IP
    }
    authentication {
        auth_type PASS
        auth_pass XXXX
    }
 
}

Keepalived config on nagios_secondary

Set role to MASTER, priority to 10, detect failure (fall) and OK (rise) state on 2 attempts, define check script – track_script, (it will be bash script which will copy files from Nagios_master and report state: 0 if all is good, 1 if there is failure), reduce priority by 2 on check script failure (weight), when nagios_secondary becomes MASTER start nagios and postfix services (notify_master /etc/keepalived/stop_nagios.sh), and when becomes BACKUP (notify_backup /etc/keepalived/start_nagios.sh), start nagios and postfix service.

/etc/keepalived/keeplaived.conf:

! Configuration File for keepalived

global_defs {

   enable_script_security 1
   script_user root
   }


vrrp_script chk_service_health {
    script /etc/keepalived/check.sh
    interval 15
    fall 2
    rise 2
    weight -2
}

vrrp_instance VI_1 {
    debug 4

    interface eth0

    state MASTER

    virtual_router_id 51
    advert_int 1
    priority 10

    virtual_ipaddress {
            192.168.0.30 dev eth0    # the virtual IP
        }

    unicast_src_ip 192.168.0.27 # Local IP

    unicast_peer {
        192.168.0.21 # Peer IP
    }

    authentication {
        auth_type PASS
        auth_pass XXXX
    }

    track_script {
        chk_service_health
    }
    notify_master /etc/keepalived/stop_nagios.sh
    notify_backup /etc/keepalived/start_nagios.sh

}

/etc/keepalived/check.sh:

#!/bin/bash

 rsync -armzv --timeout=5 --delete 192.168.0.26:/usr/local/nagios /usr/local/nagios --exclude-from /etc/keepalived/exclude-list.txt 

 if [ "$?" -eq "0" ]
then
   exit 0 # All good. Nagios master reachable
else
  exit 1 # Failover trigger
fi

/etc/keepalived/stop_nagios.sh:

#!/bin/bash

logfile=/var/log/stop_nagios.txt
exec >> $logfile
exec 2>&1

# Define an array of processes to be checked.
# If properly quoted, these may contain spaces

check_process=("nagios" "postfix" )

for p in "${check_process[@]}"; do

   if (systemctl -q is-active $p)
    then
      echo "$p is running, stopping it"
      date
      systemctl stop $p
   fi
done
exit 0

/etc/keepalived/start_nagios.sh:

#!/bin/bash

logfile=/var/log/start_nagios.txt
exec >> $logfile
exec 2>&1
# Define an array of processes to be checked.
# If properly quoted, these may contain spaces

check_process=("nagios" "postfix" )

for p in "${check_process[@]}"; do

   if (systemctl -q is-active $p)
    then
      echo "$p is running"

   else
      date
      echo "Staring $p..."
      systemctl start $p
   fi
done
exit 0

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s