878 packets transmitted, 0 received, 100% packet loss, time 14031ms

(See tpo/tpa/team#41654 for a discussion and further analysis of that specific issue.)

MTR can help diagnose issues in this case. Vary parameters like IPv6 (-6) or TCP (--tcp). In the above case, the problem could be reproduced with mtr --tcp -6 -c 10 -w maven.mozilla.org.

Tools like curl can also be useful for quick diagnostics, but note that it supports the happy eyeballs standard so it might hide (e.g. IPv6) issues that might otherwise be affecting other clients.

Unexpected reboot

If a host reboots without a manual intervention, there might be different causes for the reboot to happen. Identifying exactly what happened after the fact can be challenging or even in some cases impossible since logs might not have been updated with information about the issues.

But in some cases the logs do have some information. Some things that can be investigated:

syslog. look particularly for disk errors, OOM kill messages close to the reboot, kernel oops messages
dmesg from previous boots, e.g. journaltcl -k -b -1, or see journalctl --list-boots for a list of boot IDs available
smartctl -t long and smartctl -A / nvme [device-self-test|self-test-log] on all devices
/proc/mdadm and /proc/drbd: make sure that replication is still all right

Also note that it's possible this is a spurious warning, or that a host took longer than expected to reboot. Normally, our Fabric reboot procedures issue a silence for the monitoring system to ignore those warnings. It's possible those delays are not appropriate for this host, for example, and might need to be tweaked upwards.

Network-level attacks

This section should guide you through network availability issues.

Confirming network-level attacks with Grafana

In case of degraded service availability over the network, it's a good idea to start by looking at metrics in Grafana. Denial of service attacks against a service over the network will often cause a noticeable bump in network activity, both in terms of ingress and egress traffic.

The traffic per class dashboard is a good place to start.

Finding traffic source with iftop

Once you have found there is indeed a spike of traffic, you should try to figure out what it consists of exactly.

A useful tool to investigate this is iftop, which displays network activity in realtime via the console. Here are some useful keyboard shortcuts when using it:

n toggle DNS resolution
D toggle destination port
T toggle cumulative totals
o freeze current order
P pause display

In addition, the -f command-line argument can be used to filter network activity. For example, use iftop -f 'port 443' to only monitor HTTPS network traffic.

Firewall blocking

If you are sure that a specific $IP is mounting a Denial of Service attack on a server, you can block it with:

iptables -I INPUT -s $IP -j DROP

$IP can also be a network in CIDR notation, e.g. the following drops a whole Google /16 from the host:

iptables -I INPUT -s 74.125.0.0/16 -j DROP

Note that the above inserts (-I) a rule into the rule chain, which puts it before other rules. This is most likely what you want, as it's often possible there's an already existing rule that will allow the traffic through, making a rule appended (-A) to the chain ineffective.

This only blocks one network or host, and quite brutally, at the network level. From a user's perspective, it will look like an outage. A gentler way way is to use -j REJECT to actually send a reset packet to let the user know they're blocked.

Server blocking

An even "gentler" approach is to block clients at the server level. That way the client application can provide feedback to the user that the connection has been denied, more clearly. Typically, this is done with a web server level block list.

We don't have a uniform way to do this right now. In profile::nginx, there's a blocked_hosts list that can be used to add CIDR entries which are passed to the Nginx deny directive. Typically, you would define an entry in Hiera with something like this (example from data/roles/gitlab.yaml):

profile::nginx::blocked_hosts:
  # alibaba, tpo/tpa/team#42152
  - "47.74.0.0/15"

For Apache servers, it's even less standardized. A couple servers (currently donate and crm) have a blocklist.txt file that's used in a RewriteMap to deny individual IP addresses.

Extracting IP range lists

A command like this will extract the IP addresses from a webserver log file and group them by number of hits:

awk '{print $1}' /var/log/nginx/gitlab_access.log | grep -v '0.0.0.0' | sort | uniq -c | sort -n

This assumes log redaction has been disabled on the virtual host, of course, which can be done in emergencies like this. The most frequent hosts will show up first.

You can lookup which netblock the relevant IP addresses belong to a command like ip-info (part of the libnet-abuse-utils-perl Debian package) or asn (part of the asn package). Or this can be done by asking the asn.cymru.com service, with, for example:

nc whois.cymru.com 43 <<EOF
begin
verbose
216.90.108.31
192.0.2.1
198.51.100.0/24
203.0.113.42
end
EOF

This can be used to group IP addresses by netblock and AS number, roughly. A much more sophisticated approach is the asncounter project developed by anarcat, which allows AS and CIDR-level counting and can be used to establish a set of networks or entire ASNs to block.

The asncounter(1) manual page has detailed examples for this. That tool has been accepted in Debian unstable as of 2025-05-28 and should slowly make its way down to stable (probably Debian 14 "forky" or later). It's currently installed on gitlab-02 in /root/asncounter but may eventually be deployed site-wide through Puppet.

Filesystem set to readonly

If a filesystem is switched to readonly, it prevents any process from writing to the concerned disk, which can have consequences of differing magnitude depending on which volume is readonly.

If Linux automatically changes a filesystem to readonly, it usually indicates that some serious issues were detected with the disk or filesystem. Those can be:

physical drive errors
bad sectors or other detected ongoing data corruption
hard drive driver errors
filesystem corruption

Look out for disk- or filesystem-related errors in:

syslog
dmesg
physical console (e.g. IMPI console)

In some cases with ext4, running fsck can fix issues. However, watch out for files disappearing or being moved to lost+found if the filesystem encounters serious enough inconsistencies.

If the hard disk seems to be showing signs of breakage. Usually that disk will get ejected from the RAID array without blocking the filesystem. However if disk breakage did impact the filesystem consistency and caused it to switch to readonly, migrate the data away from that drive ASAP for example by moving the instance to its secondary node or by rsync'ing it to another machine.

In such a case, you'll also want to review what other instances are currently using the same drive and possibly move all of those instances as well before replacing the drive.

Web server down

Apache web server diagnostics

If you get an alert like ApacheDown, that is:

Apache web server down on test.example.com

It means the apache exporter cannot contact the local web server over its control address http://localhost/server-status/?auto. First, confirm whether this is a problem with the exporter or the entire service, by checking the main service on this host to see if users are affected. If that's the case, prioritize that.

It's possible, for example, that the webserver has crashed for some reason. The best way to figure that out is to check the service status with:

service apache2 status

You should see something like this if the server is running correctly:

● apache2.service - The Apache HTTP Server
     Loaded: loaded (/lib/systemd/system/apache2.service; enabled; preset: enabled)
     Active: active (running) since Tue 2024-09-10 14:56:49 UTC; 1 day 5h ago
       Docs: https://httpd.apache.org/docs/2.4/
    Process: 475367 ExecReload=/usr/sbin/apachectl graceful (code=exited, status=0/SUCCESS)
   Main PID: 338774 (apache2)
      Tasks: 53 (limit: 4653)
     Memory: 28.6M
        CPU: 11min 30.297s
     CGroup: /system.slice/apache2.service
             ├─338774 /usr/sbin/apache2 -k start
             └─475411 /usr/sbin/apache2 -k start

Sep 10 17:51:50 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 10 17:51:50 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.
Sep 10 19:53:00 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 10 19:53:00 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.
Sep 11 00:00:01 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 11 00:00:01 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.
Sep 11 01:29:29 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 11 01:29:29 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.
Sep 11 19:50:51 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 11 19:50:51 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.

With the first dot (●) in green and the line Active saying active (running). If it isn't, the logs should show why it failed to start.

It's possible you don't see the right logs in there if the service is stuck in a restart loop. In this case, that use this command instead to see the service logs:

journalctl -b -u apache2

That shows the logs for the server from the last boot.

If the main service is online and it's only the exporter having trouble, try to reproduce the issue with curl from the affected server, for example:

root@test.example.com:~# curl http://localhost/server-status/?auto

Normally, this should work, but it's possible Apache is misconfigured and doesn't listen to localhost for some reason. Look at the apache2ctl -S output, and the rest of the Apache configuration in /etc/apache2, particularly the Ports and Listen directives.

See also the Apache exporter scraping failed instructions in the Prometheus documentation, a related alert.

Disk is full or nearly full

When a disk is filled up to 100% of its capacity, some processes can have issues with continuing to work normally. For example PostgreSQL will purposefully exit when that happens in order to avoid the risk of data corruption. MySQL is not so graceful and it can end up with data corruption in some of its databases.

The first step is to check how long you have. For this, a good tool is the Grafana disk usage dashboard. Select the affected instance, and look at the "change rate" panel, it should show you how much time is left per partition.

To clear up this situation, there are two approaches that can be used in succession:

find what's using disk space and clear out some files
grow the disk

The first thing that should be attempted is to identify where disk space is used and remove some big files that occupy too much space. For example, if the root partition is full, this will show you what is taking up space:

ncdu -x /

Examples

Maybe the syslog grew to ridiculous sizes? Try:

logrotate -f /etc/logrotate.d/syslog-ng

Maybe some users have huge DB dumps laying around in their home directory. After confirming that those files can be deleted:

rm /home/flagada/huge_dump.sql

Maybe the systemd journal has grown too big. This will keep only 500MB:

journalctl --vacuum-size=500M

If in the cleanup phase you can't identify files that can be removed, you'll need to grow the disk. See how to grow disks with ganeti.

Note that it's possible a suddenly growing disk might be a symptom of a larger problem, for example bots crawling a website abusively or an attacker running a denial of service attack. This warrants further (and more complex) investigation, of course, but can be delegated to after the disk usage alert has been handled.

Host clock desynchronized

If a host's clock has drifted and is no longer in sync with the rest of the internet, some really strange things can start happening, like TLS connections failing even though the certificate is still valid.

If a host has time synchronization issues, check that the ntpd service is still running:

systemctl status ntpd.service

You can gather information about which peer servers are drifting:

ntpq -pun

Logs for this service are sent to syslog, so you can take a look there to see if some errors were mentioned.

If restarting the ntpd service does not work, verify that a firewall is not blocking port 123 UDP.

Support policies

Please see TPA-RFC-2: support.

Keyboard shortcuts