Summary: migration of the remaining Cymru services in the coming week, help needed to test new servers.

What?

TPA will be migrating a little over a dozen virtual machines (VM) off of the old Cymru cluster in Chicago to a shiny new cluster in Dallas. This is the list of affected VMs:

btcpayserver-02
ci-runner-x86-01
dangerzone-01
gitlab-dev-01
metrics-psqlts-01
onionbalance-02
probetelemetry-01
rdsys-frontend-01
static-gitlab-shim
survey-01
tb-pkgstage-01
tb-tester-01
telegram-bot-01
tpa-bootstrap-01

Members of the anticensorship and metrics teams are particularly affected, but services like BTCpayserver, dangerzone, onionbalance, and static site deplyements from GitLab (but not GitLab itself) will also be affected.

When?

We hope to start migrating the VMs on Monday 2023-03-20, but this is likely to continue during the rest of the week, as we may stop the migration process if we encounter problems.

How?

Each VM is migrated one by one, following roughly this process:

A snapshot is taken on the source cluster, then copied to the target
the VM is shutdown on the source
the target VM is renumbered so it's networked, but DNS still points to the old VM
the service is tested
if it works, then DNS records are changed to point to the new VM
after a week, the old VMs are destroyed

The TTL ("Time To Live") in DNS is currently an hour so the outage will last at least that long, for each VM. Depending on the size of the VM, the transfer could actually take much longer as well. So far a 20GB VM is transferred in about 10 minutes.

Affected team members are encouraged to coordinate with us over chat (#tor-admin on irc.OFTC.net or #tor-admin:matrix.org) during the maintenance window to test the new service (step 4 above).

You may also ask for a longer before the destruction of the old VM in step 6.

Why?

The details of that move are discussed briefly in this past proposal:

https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-40-cymru-migration

The migration took longer than expected partly because I hit a snag in the VM migration routines, which required some serious debugging and patching.

Now we finally have an automated job to batch-migrate VMs between Ganeti clusters. This means that not only will we be evacuating the Cymru cluster very soon, but we also have a clean mechanism to do this again, much faster, the next time we're in such a situation.

References

Comments welcome in tpo/tpa/team#40972, see also:

TPA-RFC-40: Cymru migration budget pre-approval
TPA-RFC-43: Cymru migration plan

Keyboard shortcuts

What?

When?

How?

Why?

References