Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Summary: switch to barman for PostgreSQL backups, rebuild or resize bungei as needed to cover for metrics needs

Background

TPA currently uses a PostgreSQL backup system that uses point-in-time recovery (PITR) backups. This is really nice because it gives us full, incremental backup history with also easy "full" restores at periodic intervals.

Unfortunately, that is built using a set of scripts only used by TPA and DSA, which are hard to use and to debug.

We want to consider other alternatives and make a plan for that migration. In tpo/tpa/team#41557, we have setup a new backup server in the secondary point of presence and should use this to backup PostgreSQL servers from the first point of presence so we could more easily survive a total site failure as well.

In TPA-RFC-63: Storage server budget, we've already proposed using barman, but didn't mention geographic distribution or a migration plan.

The plan for that server was also to deal with the disk usage explosion on the network health team which is causing the current storage server to run out of space (tpo/tpa/team#41372) but we didn't realize the largest PostgreSQL server was in the same location as the new backup server, which means the new server might not actually solve the problem, as far as databases are concerned. For this, we might need to replace our existing storage server (bungei) which is anyways getting past its retirement age, as it was setup in March 2019 (so it is 5 years old at the time of writing).

Proposal

Switch to barman as our new PostgreSQL backups system. Migrate all servers in the gnt-fsn cluster to the new system on the new backup server, then convert the legacy on the old backup server.

If necessary, resize disks on the old backup server to make room for the metrics storage, or replace that aging server with a new rental server.

Goals

Must have

  • geographic redundancy: have database backups in a different provider and geographic location than their primary storage

  • solve space issues: we're constantly having issues with the storage server filling up, we need to solve this in the long term

Nice to have

  • well-established code base: use a more standard backup software not developed and maintained only by us and debian.org

Non-Goals

  • global backup policy review: we're not touching bacula or retention policies

  • high availability: we're not setting up extra database servers for high availability, this is only for backups

Migration plan

We're again pressed for time so we need to come up with a procedure that will give us some room on the backup server while simultaneously minimizing the risk to the backup integrity.

To do this, we're going to migrate a mix of small (at first) and large (quickly than we'd like) database servers at first

Phase I: alpha testing

Migrate the following backups from bungei to backup-storage-01:

  • weather-01 (12.7GiB)
  • rude (35.1GiB)
  • materculae (151.9GiB)

Phase II: beta testing

After a week, retire the above backups from bungei, then migrate the following servers:

  • gitlab-02 (34.9GiB)
  • polyanthum (20.3GiB)
  • meronense (505.1GiB)

Phase III: production

After another week, migrate the last backups from bungei:

  • bacula-director-01 (180.8GiB)

At this point, we should hopefully have enough room on the backup server to survive the holidays.

Phase IV: retire legacy, bungei replacement

At this point, the only backups using the legacy system are the ones from the gnt-dal cluster (4 servers). Rebuild those with the new service. Do not keep a copy of the legacy system on bungei (to save space, particularly for metricsdb-01) but possibly archive a copy of the legacy backups on backup-storage-01:

  • metricsdb-01 (1.6TiB)
  • puppetdb-01 (20.2GiB)
  • survey-01 (5.7GiB)
  • anonticket-01 (3.9GiB)

If we still run out of disk space on bungei, consider replacing the server entirely. The server is now 5 years old which is getting close to our current amortization time (6 years) and it's a rental server so it's relatively easy to replace, as we don't need to buy new hardware.

Alternatives considered

See the alternatives considered in our PostgreSQL documentation.

Costs

Staff estimates (3-4 weeks)

TaskTimeComplexityEstimateDaysNote
pgbarman testing and manual setup3 dayshigh1 week6
pgbarman puppetization3 daysmedium1 week4.5
migrate 12 servers3 dayshigh1 week4.5assuming we can migrate 4 servers per day
legacy code cleanup1 daylow~1 day1.1
Sub-total2 weeks~medium3 weeks16.1
bungei replacement3 dayslow~3 days3.3optional
bungei resizing1 daylow~1 day1.1optional
Total~3 weeks~medium~4 weeks20.5

Hosting costs (+70EUR/mth, optional)

bungei is a SX132 server, billed monthly at 175EUR. It has the following specifications:

  • Intel Xeon E5-1650 (12 Core, 3.5GHz)
  • RAM: 128GiB DDR4
  • Storage: 10x10TB SAS drives (100TB, HGST HUH721010AL)

A likely replacement would be the SX135 server, at 243EUR and a 94EUR setup fee:

  • AMD Ryzen 9 3900 (12 core, 3.1GHz)
  • RAM: 128GiB
  • Storage: 8x22TB SATA drives (176TB)

There's a cheaper server, the SX65 at 124EUR/mth, but it has less disk space (4x22TB, 88TB). It might be enough, that said, if we do not need to grow bungei and simply need to retire it.

References

Appendix

Backups inventory

here's the list of current psql databases on the storage server and their locations:

serverlocationsizenote
anonticket-01gnt-dal3.9GiB
bacula-director-01gnt-fsn180.8GiB
gitlab-02gnt-fsn34.9GiBmove to gnt-dal considered, #41431
materculaegnt-fsn151.9GiB
meronensegnt-fsn505.1GiB
metricsdb-01gnt-dal1.6TiBhuge!
polyanthumgnt-fsn20.3GiB
puppetdb-01gnt-dal20.2GiB
rudegnt-fsn35.1GiB
survey-01gnt-dal5.7GiB
weather-01gnt-fsn12.7GiB

gnt-fsn servers

Same, but only for the servers at Hetzner, sorted by size:

serversize
meronense505.1GiB
bacula-director-01180.8GiB
materculae151.9GiB
rude35.1GiB
gitlab-0234.9GiB
polyanthum20.3GiB
weather-0112.7GiB

gnt-dal

Same for Dallas:

serversize
metricsdb-011.6TiB
puppetdb-0120.2GiB
survey-015.7GiB
anonticket-013.9GiB