Summary: switch to barman for PostgreSQL backups, rebuild or resize bungei as needed to cover for metrics needs
Background
TPA currently uses a PostgreSQL backup system that uses point-in-time recovery (PITR) backups. This is really nice because it gives us full, incremental backup history with also easy "full" restores at periodic intervals.
Unfortunately, that is built using a set of scripts only used by TPA and DSA, which are hard to use and to debug.
We want to consider other alternatives and make a plan for that migration. In tpo/tpa/team#41557, we have setup a new backup server in the secondary point of presence and should use this to backup PostgreSQL servers from the first point of presence so we could more easily survive a total site failure as well.
In TPA-RFC-63: Storage server budget, we've already proposed using barman, but didn't mention geographic distribution or a migration plan.
The plan for that server was also to deal with the disk usage
explosion on the network health team which is causing the current
storage server to run out of space (tpo/tpa/team#41372) but we
didn't realize the largest PostgreSQL server was in the same location
as the new backup server, which means the new server might not
actually solve the problem, as far as databases are concerned. For
this, we might need to replace our existing storage server (bungei)
which is anyways getting past its retirement age, as it was setup in
March 2019 (so it is 5 years old at the time of writing).
Proposal
Switch to barman as our new PostgreSQL backups system. Migrate all servers in the gnt-fsn cluster to the new system on the new backup server, then convert the legacy on the old backup server.
If necessary, resize disks on the old backup server to make room for the metrics storage, or replace that aging server with a new rental server.
Goals
Must have
-
geographic redundancy: have database backups in a different provider and geographic location than their primary storage
-
solve space issues: we're constantly having issues with the storage server filling up, we need to solve this in the long term
Nice to have
- well-established code base: use a more standard backup software
not developed and maintained only by us and
debian.org
Non-Goals
-
global backup policy review: we're not touching bacula or retention policies
-
high availability: we're not setting up extra database servers for high availability, this is only for backups
Migration plan
We're again pressed for time so we need to come up with a procedure that will give us some room on the backup server while simultaneously minimizing the risk to the backup integrity.
To do this, we're going to migrate a mix of small (at first) and large (quickly than we'd like) database servers at first
Phase I: alpha testing
Migrate the following backups from bungei to backup-storage-01:
- weather-01 (12.7GiB)
- rude (35.1GiB)
- materculae (151.9GiB)
Phase II: beta testing
After a week, retire the above backups from bungei, then migrate the following servers:
- gitlab-02 (34.9GiB)
- polyanthum (20.3GiB)
- meronense (505.1GiB)
Phase III: production
After another week, migrate the last backups from bungei:
- bacula-director-01 (180.8GiB)
At this point, we should hopefully have enough room on the backup server to survive the holidays.
Phase IV: retire legacy, bungei replacement
At this point, the only backups using the legacy system are the ones from the gnt-dal cluster (4 servers). Rebuild those with the new service. Do not keep a copy of the legacy system on bungei (to save space, particularly for metricsdb-01) but possibly archive a copy of the legacy backups on backup-storage-01:
- metricsdb-01 (1.6TiB)
- puppetdb-01 (20.2GiB)
- survey-01 (5.7GiB)
- anonticket-01 (3.9GiB)
If we still run out of disk space on bungei, consider replacing the server entirely. The server is now 5 years old which is getting close to our current amortization time (6 years) and it's a rental server so it's relatively easy to replace, as we don't need to buy new hardware.
Alternatives considered
See the alternatives considered in our PostgreSQL documentation.
Costs
Staff estimates (3-4 weeks)
| Task | Time | Complexity | Estimate | Days | Note |
|---|---|---|---|---|---|
| pgbarman testing and manual setup | 3 days | high | 1 week | 6 | |
| pgbarman puppetization | 3 days | medium | 1 week | 4.5 | |
| migrate 12 servers | 3 days | high | 1 week | 4.5 | assuming we can migrate 4 servers per day |
| legacy code cleanup | 1 day | low | ~1 day | 1.1 | |
| Sub-total | 2 weeks | ~medium | 3 weeks | 16.1 | |
| bungei replacement | 3 days | low | ~3 days | 3.3 | optional |
| bungei resizing | 1 day | low | ~1 day | 1.1 | optional |
| Total | ~3 weeks | ~medium | ~4 weeks | 20.5 |
Hosting costs (+70EUR/mth, optional)
bungei is a SX132 server, billed monthly at 175EUR. It has the
following specifications:
- Intel Xeon E5-1650 (12 Core, 3.5GHz)
- RAM: 128GiB DDR4
- Storage: 10x10TB SAS drives (100TB, HGST HUH721010AL)
A likely replacement would be the SX135 server, at 243EUR and a 94EUR setup fee:
- AMD Ryzen 9 3900 (12 core, 3.1GHz)
- RAM: 128GiB
- Storage: 8x22TB SATA drives (176TB)
There's a cheaper server, the SX65 at 124EUR/mth, but it has less disk space (4x22TB, 88TB). It might be enough, that said, if we do not need to grow bungei and simply need to retire it.
References
Appendix
Backups inventory
here's the list of current psql databases on the storage server and their locations:
| server | location | size | note |
|---|---|---|---|
| anonticket-01 | gnt-dal | 3.9GiB | |
| bacula-director-01 | gnt-fsn | 180.8GiB | |
| gitlab-02 | gnt-fsn | 34.9GiB | move to gnt-dal considered, #41431 |
| materculae | gnt-fsn | 151.9GiB | |
| meronense | gnt-fsn | 505.1GiB | |
| metricsdb-01 | gnt-dal | 1.6TiB | huge! |
| polyanthum | gnt-fsn | 20.3GiB | |
| puppetdb-01 | gnt-dal | 20.2GiB | |
| rude | gnt-fsn | 35.1GiB | |
| survey-01 | gnt-dal | 5.7GiB | |
| weather-01 | gnt-fsn | 12.7GiB |
gnt-fsn servers
Same, but only for the servers at Hetzner, sorted by size:
| server | size |
|---|---|
| meronense | 505.1GiB |
| bacula-director-01 | 180.8GiB |
| materculae | 151.9GiB |
| rude | 35.1GiB |
| gitlab-02 | 34.9GiB |
| polyanthum | 20.3GiB |
| weather-01 | 12.7GiB |
gnt-dal
Same for Dallas:
| server | size |
|---|---|
| metricsdb-01 | 1.6TiB |
| puppetdb-01 | 20.2GiB |
| survey-01 | 5.7GiB |
| anonticket-01 | 3.9GiB |