Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Summary: provision test servers that sit idle to monitor infrastructure and stage deployments

Background

In various recent incidents, it became apparent that we don't have a good place to test deployments or "normal" behavior on servers.

Examples:

  • While deploying the needrestart package (tpo/tpa/team#41633), we had to deploy on perdulce (AKA people.tpo) and test there. This had no negative impact.

  • While testing a workaround to mini-nag's deprecation (tpo/tpa/team#41734), perdulce was used again, but an operator error destroyed /dev/null, and the operator failed to recreate it. Impact was minor: some errors during a nightly job, which a reboot promptly fixed.

  • While diagnosing a network outage (e.g. tpo/tpa/team#41740), it can be hard to tell if issues are related to a server's exotic configuration or our baseline (in that case, single-stack IPv4 vs IPv6)

  • While diagnosing performance issues in Ganeti clusters, we can sometimes suffer from the "noisy neighbor" syndrome, where another VM in the cluster "pollutes" the server and causes bad performance

  • Rescue boxes were setup with not enough disk space, because we actually have no idea what our minimum space requirements are (tpo/tpa/team#41666)

We previously had a ipv6only.torproject.org server, which was retired in TPA-RFC-23 (tpo/tpa/team#40727) because it was undocumented and blocking deployment. It also didn't seem to have any sort of configuration management.

Proposal

Create a pair of "idle canary servers", one per cluster, named idle-fsn-01 and idle-dal-02.

Optionally deploy an idle-dal-ipv6only-03 and idle-dal-ipv4only-04 pair to test single-stack configuration for eventual dual-stack monitoring (tpo/tpa/team#41714).

Server specifications and usage

  • zero configuration in Puppet, unless specifically required for the role (e.g. an IPv4-only or IPv6 stack might be an acceptable configuration)
  • some test deployments are allowed, but should be reverted cleanly as much as possible. on total failure, a new host should be reinstalled from scratch instead of letting it drift into unmanaged chaos
  • files in /home and /tmp cleared out automatically on a weekly basis, motd clearly stating that fact

Hardware configuration

componentcurrent minimumproposed specnote
CPU count11
RAM960MiB512MiBcovers 25% of current servers
Swap50MiB100MiBcovers 90% of current servers
Total Disk10GiB~5.6GiB
/3GiB5GiBcurrent median used size
/boot270MiB512MiB/boot often filling up on dal-rescue hosts
/boot/efi124MiBN/Ano EFI support in Ganeti clusters
/home10GiBN/A/home on root filesystem
/srv10GiBN/Asame

Goals

  • identify "noisy neighbors" in each Ganeti cluster
  • keep a long term "minimum requirements" specification for servers, continuously validated throughout upgrades
  • provide a impact-less testing ground for upgrades, test deployments and environments
  • trace long-term usage trends, for example electric power usage (tpo/tpa/team#40163) or recurring jobs like unattended upgrades (tpo/tpa/team#40934) basic CPU usage cycles

Timeline

No fixed timeline. Those servers can be deployed in our precious free time, but it would be nice to actually have them deployed eventually. No rush.

Appendix

Some observations on current usage:

Memory usage

Sample query (25th percentile):

quantile(0.25, node_memory_MemTotal_bytes -
  node_memory_MemFree_bytes - (node_memory_Cached_bytes +
  node_memory_Buffers_bytes))
 ≈ 486 MiB
  • minimum is currently carinatum, at 228MiB, perdulce and ssh-dal are more around 300MiB
  • a quarter of servers use less than 512MiB of RAM, median is 1GiB, 90th %ile is 17GB
  • largest memory used is dal-node-01, at 310GiB used (out of 504GiB, 61.5%)
  • largest used ratio is colchicifolium at 94.2%, followed by gitlab-02 at 68%
  • largest memory size is ci-runner-x86-03 at 1.48TiB, followed by the dal-node cluster at 504GiB each, median is 8GiB, 90%ile is 74GB

Swap usage

Sample query (median used swap):

quantile(0.5, node_memory_SwapTotal_bytes-node_memory_SwapFree_bytes)
= 0 bytes
  • Median swap usage is zero, in other words, 50% of servers do not touch swap at all
  • median size is 2GiB
  • some servers have large swap space (tb-build-02 and -03 have 300GiB, -06 has 100GiB and gnt-fsn nodes have 64GiB)
PercentileUsageSize
50%02GiB
75%16MiB4GiB
90%100MiBN/A
95%400MiBN/A
99%1.2GiBN/A

Disk usage

Sample query (median root partition used space):

quantile(0.5,
  sum(node_filesystem_size_bytes{mountpoint="/"}) by (alias, mountpoint)
  - sum(node_filesystem_avail_bytes{mountpoint="/"}) by (alias,mountpoint)
)
≈ 5GiB
  • 90% of servers fit in 10GiB of disk space for the root, median around 5GiB filesystem usage
  • median /boot usage is actually much lower than our specification, at 139,4 MiB, but the problem is with edge cases, and we know we're having trouble at the 2^8MiB (256MiB) boundary, so we're simply doubling that

CPU usage

Sample query (median percentage with one decimal):

quantile(0.5,
  round(
    sum(
      rate(node_cpu_seconds_total{mode!="idle"}[24h])
    ) by (instance)
    / count(node_cpu_seconds_total{mode="idle"}) by (instance) * 1000)
  /10
)
≈ 2.5%

Servers sorted by CPU usage in the last 7 days:

sort_desc(
  round(
    sum(
       rate(node_cpu_seconds_total{mode!="idle"}[7d])
    ) by (instance)
    / count(node_cpu_seconds_total{mode="idle"}) by (instance) * 1000)
  /10
)
  • Half of servers use only 2.5% of CPU time per day over the last 24h.
  • median is, perhaps surprisingly, similar for the last 30 days.
  • metricsdb-01 used 76% of a CPU in the last 24h at the time of writing
  • over the last week, results vary more, relay-01 using 45%, colchicifolium and check-01 40%, metricsdb-01 33%...
Percentilelast 24h usage ratio
50th (median)2.5%
90th22%
95th32%
99th45%