Summary: migration of the remaining Cymru services in the coming week, help needed to test new servers.
What?
TPA will be migrating a little over a dozen virtual machines (VM) off of the old Cymru cluster in Chicago to a shiny new cluster in Dallas. This is the list of affected VMs:
- btcpayserver-02
- ci-runner-x86-01
- dangerzone-01
- gitlab-dev-01
- metrics-psqlts-01
- onionbalance-02
- probetelemetry-01
- rdsys-frontend-01
- static-gitlab-shim
- survey-01
- tb-pkgstage-01
- tb-tester-01
- telegram-bot-01
- tpa-bootstrap-01
Members of the anticensorship and metrics teams are particularly affected, but services like BTCpayserver, dangerzone, onionbalance, and static site deplyements from GitLab (but not GitLab itself) will also be affected.
When?
We hope to start migrating the VMs on Monday 2023-03-20, but this is likely to continue during the rest of the week, as we may stop the migration process if we encounter problems.
How?
Each VM is migrated one by one, following roughly this process:
- A snapshot is taken on the source cluster, then copied to the target
- the VM is shutdown on the source
- the target VM is renumbered so it's networked, but DNS still points to the old VM
- the service is tested
- if it works, then DNS records are changed to point to the new VM
- after a week, the old VMs are destroyed
The TTL ("Time To Live") in DNS is currently an hour so the outage will last at least that long, for each VM. Depending on the size of the VM, the transfer could actually take much longer as well. So far a 20GB VM is transferred in about 10 minutes.
Affected team members are encouraged to coordinate with us over chat (#tor-admin on irc.OFTC.net or #tor-admin:matrix.org) during the maintenance window to test the new service (step 4 above).
You may also ask for a longer before the destruction of the old VM in step 6.
Why?
The details of that move are discussed briefly in this past proposal:
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-40-cymru-migration
The migration took longer than expected partly because I hit a snag in the VM migration routines, which required some serious debugging and patching.
Now we finally have an automated job to batch-migrate VMs between Ganeti clusters. This means that not only will we be evacuating the Cymru cluster very soon, but we also have a clean mechanism to do this again, much faster, the next time we're in such a situation.
References
Comments welcome in tpo/tpa/team#40972, see also:
- TPA-RFC-40: Cymru migration budget pre-approval
- TPA-RFC-43: Cymru migration plan