Starting with Nagios

My Home IT has grown quite a bit over the last years. I am now at a point where a problem can have causes in a bunch of different machines.
This means, that i actually can spend quite some time to identify the service that has the problem. For example if my SQL Server goes down a lot of other services will not work anymore, but i might only see that my Mediacentre stopped working or that i can’t log into my E-Mail account anymore.

Now my Network is not that vast and i should be able to find the Problem within an acceptable time, but with a properly set up monitoring System i still can save some time on troubleshooting. Also i have wanted to look a little bit more into Nagios, to see to what extend i can tune our Work Nagios since that one is pretty basic.

For starters i created a new Debian OpenVZ container and installed the Nagios Package.
During the install you will be asked to create a password, you should try to remember that.
Once the install is done, you should have a running Nagios. Check by browsing to “http://machineipaddress/nagios3”, you can login with username “nagiosadmin” and the password you created during the installation. If you could login successfully, you can leave the webinterface again. There is not much to see yet and the configuration will not be done in the webinterface.

In Debian the config files are located at “/etc/nagios3/” and you can put your configuration files in the conf.d sub directory. There are already a files there, they are used for checking the Nagios host and its services. They also serve as example files and have the different options used explained.
You can configure everything in one big file or you can create a separate config file for everything. Both ways are probably not the best. You need to think about what you want to monitor and how to best structure the configuration for it. I chose to create one config file per host and define the host and all services of the host in it. This way my config directory will not be too cluttered and it will still be easy to find my way around in the config file.

Now that you have decided on a way to structure your configuration, it is time to create the first Host definition. A Host definition made by the use of “define host{}”.

A Host definition needs to contain these:
host_name: short name of the host, used to reference the host in other parts of the Nagios configuration. I usually use the fqdn of the host here
alias: longer name for the host. This name can be used for alarms and the webinterface. I use this for a short description.
address: IP address of the host. You could also use the fqdn, but that would be bad if your DNS goes down.
check_command: technically this is not necessary for the host definition, but if you don’t set the the host won’t be checked. This is a simple ping check most of the time.
max_check_attempts: Used to specify how many failed check attempts there can be before an alarm is generated.
check_period: references a timeperiod object. Sets the time period in which the host will be monitored.
contact_groups: one or more predefined contact groups that should be alerted on failure. You could also use contacts if you only wish to alert specific people.
notification_interval: defines the time period after which an alarm is resent. One unit here is 60s this mean putting a 30 here would mean alarms are sent every 30 minutes until they are resolved. You can change the 60s to anything you like with the interval_length option in the nagios.cfg.
notification_period defines the time Period in which alarms are sent.

There are also a few important optional parameters:
check_interval: defines the time between the host checks
retry_intervall: The time between host checks in case of an error. This was you can configure Nagios to check your host every 10 minutes normally, but every minute when there is an error.
parents: comma separated list of hosts and network equipment between Nagios and the host. If one of the parents fails and the host can not be reached anymore, the host will be marked as unreachable instead of down. You can refer to the parents by their host_name.
notification_options: defines for which states you want to receive notifications in a comma separated list.
possible options are:

  • d — the host is down
  • u — the host is unreachable due to a parent being down
  • f — flapping, the host keeps changing its state within a short amount of time
  • s — the host is down for a scheduled downtime
  • n — do not send alerts for this host

Here is an example Host configuration:

#Definition for the Host webserver
define host{
host_name webserver
alias description of the host
address 192.168.0.1
check_command check-host-alive
check_interval 5
retry_interval 1
max_check_attempts 5
check_period 24x7
contact_groups admins
notification_interval 720
notification_period 24x7
notification_options d,u,r
}

Now quite a few of those options are probably the same for most of your hosts. You could copy and paste them into all of your hosts, or you could simply create a template host that you can reference in your other host definitions. In order to create a template, you need to create a host definition with all your desired default values for you host. The template also needs to contain the parameter “register 0”, this will tell Nagios not to load the object as actual host. The Debian install of Nagios comes with a prepared template already. Here is a template i use as an example.

# Definition for my default host template
define host{
name default-host
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
check_interval 5
retry_interval 1
max_check_attempts 3
check_period 24x7
contact_groups admins
notification_interval 720
notification_period 24x7
notification_options d,u,r
check_command check-host-alive
}

Now that you have a template, you need to reference it in you host definition. This is done with the “use” parameter. You can overwrite any parameter in the template by defining it again in the host definition. Here is an example:

define host{
host_name webserver
alias Debian Webserver
use default-host
address 192.168.80.11
check_command check-host-alive
max_check_attempts 5 ;overwrite the setting in the host template
}

Once you have defined a bunch of hosts, you will notice that the Nagios Map gets quite messy. This happens if you do not define parent relationships. This is not the only problem that the lack of parent relationships will cause. Without defined parents Nagios will mark every host that fails the host check as down. But this is not actually what you want. The reason you set up Nagios in the first place is, so that it can tell you where the problem in your network is. Once you taught Nagios about your network structure it can do that.
Lets imagine your Network looks something like this:
Lets say your router dies. Without parents Nagios would tell you, your Mailserver, your Laptop and your Router are down. But if you define the parents correctly, it will tell you your router is down and your Mailserver and Laptop are unreachable. With the default host template unreachable will not generate an alert. This is a good thing since the error you need to fix is probably not with the unreachable host.
Here is How you define the parents for my example:

define host{
host_name homeserver
alias homeserver with nagios
...
}
define host{
host_name switch1
alias my switch
parents homeserver
...
}
define host{
host_name workstation
alias my computer
parents switch1
...
}
define host{
host_name router1
alias my wlan router
parents switch1
...
}
define host{
host_name laptop
alias my laptop
parents router1
...
}
define host{
host_name mailserver
alias my mail server on the internet
parents router1
...
}

This covers the basic setup for Host monitoring. The next time i find some time for writing something i will cover the basics of service monitoring with nagios.

Leave a Reply

Your email address will not be published. Required fields are marked *