Summary: provision test servers that sit idle to monitor infrastructure and stage deployments

Background

In various recent incidents, it became apparent that we don't have a good place to test deployments or "normal" behavior on servers.

Examples:

While deploying the needrestart package (tpo/tpa/team#41633), we had to deploy on perdulce (AKA people.tpo) and test there. This had no negative impact.
While testing a workaround to mini-nag's deprecation (tpo/tpa/team#41734), perdulce was used again, but an operator error destroyed /dev/null, and the operator failed to recreate it. Impact was minor: some errors during a nightly job, which a reboot promptly fixed.
While diagnosing a network outage (e.g. tpo/tpa/team#41740), it can be hard to tell if issues are related to a server's exotic configuration or our baseline (in that case, single-stack IPv4 vs IPv6)
While diagnosing performance issues in Ganeti clusters, we can sometimes suffer from the "noisy neighbor" syndrome, where another VM in the cluster "pollutes" the server and causes bad performance
Rescue boxes were setup with not enough disk space, because we actually have no idea what our minimum space requirements are (tpo/tpa/team#41666)

We previously had a ipv6only.torproject.org server, which was retired in TPA-RFC-23 (tpo/tpa/team#40727) because it was undocumented and blocking deployment. It also didn't seem to have any sort of configuration management.

Proposal

Create a pair of "idle canary servers", one per cluster, named idle-fsn-01 and idle-dal-02.

Optionally deploy an idle-dal-ipv6only-03 and idle-dal-ipv4only-04 pair to test single-stack configuration for eventual dual-stack monitoring (tpo/tpa/team#41714).

Server specifications and usage

zero configuration in Puppet, unless specifically required for the role (e.g. an IPv4-only or IPv6 stack might be an acceptable configuration)
some test deployments are allowed, but should be reverted cleanly as much as possible. on total failure, a new host should be reinstalled from scratch instead of letting it drift into unmanaged chaos
files in /home and /tmp cleared out automatically on a weekly basis, motd clearly stating that fact

Hardware configuration

component	current minimum	proposed spec	note
CPU count	1	1
RAM	960MiB	512MiB	covers 25% of current servers
Swap	50MiB	100MiB	covers 90% of current servers
Total Disk	10GiB	~5.6GiB
/	3GiB	5GiB	current median used size
/boot	270MiB	512MiB	/boot often filling up on dal-rescue hosts
/boot/efi	124MiB	N/A	no EFI support in Ganeti clusters
/home	10GiB	N/A	/home on root filesystem
/srv	10GiB	N/A	same

Goals

identify "noisy neighbors" in each Ganeti cluster
keep a long term "minimum requirements" specification for servers, continuously validated throughout upgrades
provide a impact-less testing ground for upgrades, test deployments and environments
trace long-term usage trends, for example electric power usage (tpo/tpa/team#40163) or recurring jobs like unattended upgrades (tpo/tpa/team#40934) basic CPU usage cycles

Timeline

No fixed timeline. Those servers can be deployed in our precious free time, but it would be nice to actually have them deployed eventually. No rush.

Appendix

Some observations on current usage:

Memory usage

Sample query (25th percentile):

quantile(0.25, node_memory_MemTotal_bytes -
  node_memory_MemFree_bytes - (node_memory_Cached_bytes +
  node_memory_Buffers_bytes))
 ≈ 486 MiB

minimum is currently carinatum, at 228MiB, perdulce and ssh-dal are more around 300MiB
a quarter of servers use less than 512MiB of RAM, median is 1GiB, 90th %ile is 17GB
largest memory used is dal-node-01, at 310GiB used (out of 504GiB, 61.5%)
largest used ratio is colchicifolium at 94.2%, followed by gitlab-02 at 68%
largest memory size is ci-runner-x86-03 at 1.48TiB, followed by the dal-node cluster at 504GiB each, median is 8GiB, 90%ile is 74GB

Swap usage

Sample query (median used swap):

quantile(0.5, node_memory_SwapTotal_bytes-node_memory_SwapFree_bytes)
= 0 bytes

Median swap usage is zero, in other words, 50% of servers do not touch swap at all
median size is 2GiB
some servers have large swap space (tb-build-02 and -03 have 300GiB, -06 has 100GiB and gnt-fsn nodes have 64GiB)

Percentile	Usage	Size
50%	0	2GiB
75%	16MiB	4GiB
90%	100MiB	N/A
95%	400MiB	N/A
99%	1.2GiB	N/A

Disk usage

Sample query (median root partition used space):

quantile(0.5,
  sum(node_filesystem_size_bytes{mountpoint="/"}) by (alias, mountpoint)
  - sum(node_filesystem_avail_bytes{mountpoint="/"}) by (alias,mountpoint)
)
≈ 5GiB

90% of servers fit in 10GiB of disk space for the root, median around 5GiB filesystem usage
median /boot usage is actually much lower than our specification, at 139,4 MiB, but the problem is with edge cases, and we know we're having trouble at the 2^8MiB (256MiB) boundary, so we're simply doubling that

CPU usage

Sample query (median percentage with one decimal):

quantile(0.5,
  round(
    sum(
      rate(node_cpu_seconds_total{mode!="idle"}[24h])
    ) by (instance)
    / count(node_cpu_seconds_total{mode="idle"}) by (instance) * 1000)
  /10
)
≈ 2.5%

Servers sorted by CPU usage in the last 7 days:

sort_desc(
  round(
    sum(
       rate(node_cpu_seconds_total{mode!="idle"}[7d])
    ) by (instance)
    / count(node_cpu_seconds_total{mode="idle"}) by (instance) * 1000)
  /10
)

Half of servers use only 2.5% of CPU time per day over the last 24h.
median is, perhaps surprisingly, similar for the last 30 days.
metricsdb-01 used 76% of a CPU in the last 24h at the time of writing
over the last week, results vary more, relay-01 using 45%, colchicifolium and check-01 40%, metricsdb-01 33%...

Percentile	last 24h usage ratio
50th (median)	2.5%
90th	22%
95th	32%
99th	45%

Keyboard shortcuts