Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Nagios/Icinga service for Tor Project infrastructure

RETIRED

NOTE: the Nagios server was retired in 2024.

This documentation is kept for historical reference.

See TPA-RFC-33.

How-to

Getting status updates

  • Using a web browser: https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems&sortobject=services&sorttype=1&sortoption=2
  • On IRC: /j #tor-nagios
  • Over email: Add your email address to tor-nagios/config/static/objects/contacts.cfg

How to run a nagios check manually on a host (TARGET.tpo)

NCHECKFILE=$(egrep -A 4 THE-SERVICE-TEXT-FROM-WEB | egrep '^ *nrpe:' | cut -d : -f 2 | tr -d ' |"')
NCMD=$(ssh -t TARGET.tpo grep "$NCHECKFILE" /etc/nagios -r)
: NCMD is the command that's being run. If it looks sane, run it. With --verbose if you like more output.
ssh -t TARGET.tpo "$NCMD" --verbose

Changing the Nagios configuration

Hosts and services are managed in the config/nagios-master.cfg YAML configuration file, kept in the nagiosadm@nagios.torproject.org:/home/nagiosadm/tor-nagios repository. Make changes with a normal text editor, commit and push:

$EDITOR config/nagios-master.cfg
git commit -a
git push

Carefully watch the output of the git push command! If there is an error, your changes won't show up (and the commit is still accepted).

Forcing a rebuild of the configuration

If the Nagios configuration seems out of sync with the YAML config, a rebuild of the configuration can be forced with this command on the Nagios server:

touch /home/nagiosadm/tor-nagios/config/nagios-master.cfg && sudo -u nagiosadm make -C /home/nagiosadm/tor-nagios/config

Alternatively, changing the .cfg file and pushing a new commit should trigger this as well.

Batch jobs

You can run batch commands from the web interface, thanks to Icinga's changes to the UI. But there is also a commandline client called icli which can do this from the commandline, on the Icinga server.

This, for example, will queue recheck jobs on all problem hosts:

icli -z '!o,!A,!S,!D' -a recheck

This will run the dsa-update-apt-status command on all problem hosts:

cumin "$(ssh hetzner-hel1-01.torproject.org "icli -z'"'!o,!A,!S,!D'"'" | grep ^[a-z] | sed 's/$/.torproject.org or/') false" dsa-update-apt-status

It's kind of an awful hack -- take some time to appreciate the quoting required for those ! -- which might not be necessary with later Icinga releases. Icinga 2 has a REST API and its own command line console which makes icli completely obsolete.

Adding a new admin user

When a user needs to be added to the admin group, follow the steps below in the tor-nagios.git repository

  1. Create a new contact for the user in config/static/objects/contacts.cfg:
define contact{
       contact_name                    <username>
       alias                           <username>
       service_notification_period     24x7
       host_notification_period        24x7
       service_notification_options    w,u,c,r
       host_notification_options       d,r
       service_notification_commands   notify-service-by-email
       host_notification_commands      notify-host-by-email
       email                           <email>+nagios@torproject.org
       }
  1. Add the user to authorized_for_full_command_resolution and authorized_for_configuration_information in config/static/cgi.cfg:
authorized_for_full_command_resolution=user1,foo,bar,<new user>
authorized_for_configuration_information=user1,foo,bar,<new user>

Pager playbook

What is this alert anyways?

Say you receive a mysterious alert and you have no idea what it's about. Take, for example, tpo/tpa/team#40795:

09:35:23 <nsa> tor-nagios: [gettor-01] application service - gettor status is CRITICAL: 2: b[AUTHENTICATIONFAILED] Invalid credentials (Failure)

To figure out what triggered this error, follow this procedure:

  1. log into the Nagios web interface at https://nagios.torproject.org

  2. find the broken service, for example by listing all unhandled problems

  3. click on the actual service name to see details

  4. find the "executed command" field and click on "Command Expander"

  5. this will show you the "Raw commandline" that nagios runs to do this check, in this case it is a NRPE check that calls tor_application_service on the other end

  6. if it's an NRPE check, log on the remote host and run the command, otherwise, the command is ran on the nagios host

In this case, the error can be reproduced with:

root@gettor-01:~# /usr/lib/nagios/plugins/dsa-check-statusfile /srv/gettor.torproject.org/check/status
2: b'[AUTHENTICATIONFAILED] Invalid credentials (Failure)'

In this case, it seems like the status file is under the control of the service administrator, which should be contacted for followup.

Reference

Design

Config generation

The Nagios/Icinga configuration gets generated from the config/nagios-master.cfg YAML configuration file stored in the tor-nagios.git repository. The generation works like this:

  1. operator pushes changes to the git repository on the Nagios server (in /home/nagiosadm/tor-nagios)

  2. the post-receive hook calls make in the config sub-directory, which calls ./build-nagios to generate the files in ~/tor-nagios/config/generated/

  3. the hook then calls make install, which:

  4. deploys the config file (using rsync) in /etc/inciga/from-git...

  5. pushes the NRPE config to the Puppet server in nagiospush@pauli.torproject.org:/etc/puppet/modules/nagios/files/tor-nagios/generated/nrpe_tor.cfg

  6. reloads Incinga

  7. and finally mirrors the repository to GitLab (https://gitlab.torproject.org/tpo/tpa/tor-nagios)