Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help


title: "TPA-RFC-80: Debian 13 ("trixie") upgrade schedule" costs: staff, 4+ weeks approval: TPA, service admins affected users: TPA, service admins deadline: 2 weeks, 2025-04-01 status: standard discussion: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41990

Summary: start upgrading servers during the Debian 13 ("trixie") freeze, if it goes well, complete most of the fleet upgrade in around June 2025, with full completion by the end of 2025, with a 2026 year free of major upgrades entirely. Improve automation and cleanup old code.

Background

Debian 13 ("trixie"), currently "testing", is going into freeze soon, which means we should have a new Debian stable release in 2025. It has been a long-standing tradition at TPA to collaborate in the Debian development process and part of that process is to upgrade our servers during the freeze. Upgrading during the freeze makes it easier for us to fix bugs as we find them and contribute them to the community.

The freeze dates announced by the debian.org release team are:

2025-03-15      - Milestone 1 - Transition and toolchain freeze
2025-04-15      - Milestone 2 - Soft Freeze
2025-05-15      - Milestone 3 - Hard Freeze - for key packages and
                                packages without autopkgtests
To be announced - Milestone 4 - Full Freeze

We have entered the "transition and toolchain freeze" which locks changes on packages like compilers and interpreters unless exceptions. See the Debian freeze policy for an explanation of each step.

Even though we've just completed the Debian 11 ("bullseye") and 12 ("bookworm") upgrades in late 2024, we feel it's a good idea to start and complete the Debian 13 upgrades in 2025. That way, we can hope of having a year or two (2026-2027?) without any major upgrades.

This proposal is part of the Debian 13 trixie upgrade milestone, itself part of the 2025 TPA roadmap.

Proposal

As usual, we perform the upgrades in three batches, in increasing order of complexity, starting in 2025Q2, hoping to finish by the end of 2025.

Note that, this year, this proposal also includes upgrading the Tails infrastructure as well. To help with merging rotations in the two teams, TPA staff will upgrade Tails machines, with Tails folks assistance, and vice-versa.

Affected users

All service admins are affected by this change. If you have shell access on any TPA server, you want to read this announcement.

In the past, TPA has typically kept a page detailing notable changes and a proposal like this one would link against the upstream release notes. Unfortunately, at the time writing, upstream hasn't yet produced release notes (as we're still in testing).

We're hoping the documentation will be refined by the time we're ready to coordinate the second batch of updates, around May 2025, when we will send reminders to affected teams.

We do expect the Debian 13 upgrade to be less disruptive than bookworm, mainly because Python 2 is already retired.

Notable changes

For now, here are some known changes that are already in Debian 13:

Package12 (bookworm)13 (trixie)
Ansible7.711.2
Apache2.4.622.4.63
Bash5.2.155.2.37
Emacs28.230.1
Fish3.64.0
Git2.392.45
GCC12.214.2
Golang1.191.24
Linux kernel image6.1 series6.12 series
LLVM1419
MariaDB10.1111.4
Nginx1.221.26
OpenJDK1721
OpenLDAP2.5.132.6.9
OpenSSL3.03.4
PHP8.28.4
Podman4.35.4
PostgreSQL1517
Prometheus2.422.53
Puppet78
Python3.113.13
Rustc1.631.85
Vim9.09.1

Most of those, except "tool chains" (e.g. LLVM/GCC) can still change, as we're not in the full freeze yet.

Upgrade schedule

The upgrade is split in multiple batches:

  • automation and installer changes

  • low complexity: mostly TPA services and less critical Tails servers

  • moderate complexity: TPA "service admins" machines and remaining Tails physical servers and VMs running services from the official Debian repositories only

  • high complexity: Tails VMs running services not from the official Debian repositories

  • cleanup

The free time between the first two batches will also allow us to cover for unplanned contingencies: upgrades that could drag on and other work that will inevitably need to be performed.

The objective is to do the batches in collective "upgrade parties" that should be "fun" for the team. This policy has proven to be effective in the previous upgrades and we are eager to repeat it again.

Upgrade automation and installer changes

First, we tweak the installers to deploy Debian 13 by default to avoid installing further "old" systems. This includes the bare-metal installers but also and especially the virtual machine installers and default container images.

Concretely, we're planning on changing the stable container image tag to point to trixie in early April. We will be working on a retirement policy for container images later, as we do not want to bury that important (and new) policy here. For now, you should assume that bullseye images are going to go away soon (tpo/tpa/base-images#19), but a separate announcement will be issued for this (tpo/tpa/base-images#24).

New idle canary servers will be setup in Debian 13 to test integration with the rest of the infrastructure, and future new machine installs will be done in Debian 13.

We also want to work on automating the upgrade procedure further. We've had catastrophic errors in the PostgreSQL upgrade procedure in the past, in particular, but the whole procedure is now considered ripe for automation, see tpo/tpa/team#41485 for details.

Batch 1: low complexity

This is scheduled during two weeks: TPA boxes will be upgraded in the last week of April, and Tails in the first week of May.

The idea is to start the upgrade long enough before the vacations to give us plenty of time to recover, and some room to start the second batch.

In April, Debian should also be in "soft freeze", not quite a fully "stable" environment, but that should be good enough for simple setups.

36 TPA machines:

- [ ] archive-01.torproject.org
- [ ] cdn-backend-sunet-02.torproject.org
- [ ] chives.torproject.org
- [ ] dal-rescue-01.torproject.org
- [ ] dal-rescue-02.torproject.org
- [ ] gayi.torproject.org
- [ ] hetzner-hel1-02.torproject.org
- [ ] hetzner-hel1-03.torproject.org
- [ ] hetzner-nbg1-01.torproject.org
- [ ] hetzner-nbg1-02.torproject.org
- [ ] idle-dal-02.torproject.org
- [ ] idle-fsn-01.torproject.org
- [ ] lists-01.torproject.org
- [ ] loghost01.torproject.org
- [ ] mandos-01.torproject.org
- [ ] media-01.torproject.org
- [ ] metricsdb-01.torproject.org
- [ ] minio-01.torproject.org
- [ ] mta-dal-01.torproject.org
- [ ] mx-dal-01.torproject.org
- [ ] neriniflorum.torproject.org
- [ ] ns3.torproject.org
- [ ] ns5.torproject.org
- [ ] palmeri.torproject.org
- [ ] perdulce.torproject.org
- [ ] srs-dal-01.torproject.org
- [ ] ssh-dal-01.torproject.org
- [ ] static-gitlab-shim.torproject.org
- [ ] staticiforme.torproject.org
- [ ] static-master-fsn.torproject.org
- [ ] submit-01.torproject.org
- [ ] vault-01.torproject.org
- [ ] web-dal-07.torproject.org
- [ ] web-dal-08.torproject.org
- [ ] web-fsn-01.torproject.org
- [ ] web-fsn-02.torproject.org

4 Tails machines:

ecours.tails.net
puppet.lizard
skink.tails.net
stone.tails.net

In the first batch of bookworm machines, we ended up taking 20 minutes per machine, done in a single day, but warned that the second batch took longer.

It's probably safe to estimate 20 hours (30 minutes per machine) for this work, in a single week.

Feedback and coordination of this batch happens in issue batch 1.

Batch 2: moderate complexity

This is scheduled for the last week of may for TPA machines, and the first week of June for Tails.

At this point, Debian testing should be in "hard freeze", which should be more stable.

39 TPA machines:

- [ ] anonticket-01.torproject.org
- [ ] backup-storage-01.torproject.org
- [ ] bacula-director-01.torproject.org
- [ ] btcpayserver-02.torproject.org
- [ ] bungei.torproject.org
- [ ] carinatum.torproject.org
- [ ] check-01.torproject.org
- [ ] ci-runner-x86-02.torproject.org
- [ ] ci-runner-x86-03.torproject.org
- [ ] colchicifolium.torproject.org
- [ ] collector-02.torproject.org
- [ ] crm-int-01.torproject.org
- [ ] dangerzone-01.torproject.org
- [ ] donate-01.torproject.org
- [ ] donate-review-01.torproject.org
- [ ] forum-01.torproject.org
- [ ] gitlab-02.torproject.org
- [ ] henryi.torproject.org
- [ ] materculae.torproject.org
- [ ] meronense.torproject.org
- [ ] metricsdb-02.torproject.org
- [ ] metrics-store-01.torproject.org
- [ ] onionbalance-02.torproject.org
- [ ] onionoo-backend-03.torproject.org
- [ ] polyanthum.torproject.org
- [ ] probetelemetry-01.torproject.org
- [ ] rdsys-frontend-01.torproject.org
- [ ] rdsys-test-01.torproject.org
- [ ] relay-01.torproject.org
- [ ] rude.torproject.org
- [ ] survey-01.torproject.org
- [ ] tbb-nightlies-master.torproject.org
- [ ] tb-build-02.torproject.org
- [ ] tb-build-03.torproject.org
- [ ] tb-build-06.torproject.org
- [ ] tb-pkgstage-01.torproject.org
- [ ] tb-tester-01.torproject.org
- [ ] telegram-bot-01.torproject.org
- [ ] weather-01.torproject.org

17 Tails machines:

apt-proxy.lizard
apt.lizard
bitcoin.lizard
bittorrent.lizard
bridge.lizard
dns.lizard
dragon.tails.net
gitlab-runner.iguana
iguana.tails.net
lizard.tails.net
mail.lizard
misc.lizard
puppet-git.lizard
rsync.lizard
teels.tails.net
whisperback.lizard
www.lizard

The second batch of bookworm upgrades took 33 hours for 31 machines, so about one hour per box. Here we have 57 machines, so it will likely take us 60 hours (or two weeks) to complete the upgrade.

Feedback and coordination of this batch happens in issue batch 2.

Batch 3: high complexity

Those machines are harder to upgrade, or more critical. In the case of TPA machines, we typically regroup the Ganeti servers and all the "snowflake" servers that are not properly Puppetized and full of legacy, namely the LDAP, DNS, and Puppet servers.

That said, we waited a long time to upgrade the Ganeti cluster for bookworm, and it turned out to be trivial, so perhaps those could eventually be made part of the second batch.

15 TPA machines:

- [ ] alberti.torproject.org
- [ ] dal-node-01.torproject.org
- [ ] dal-node-02.torproject.org
- [ ] dal-node-03.torproject.org
- [ ] fsn-node-01.torproject.org
- [ ] fsn-node-02.torproject.org
- [ ] fsn-node-03.torproject.org
- [ ] fsn-node-04.torproject.org
- [ ] fsn-node-05.torproject.org
- [ ] fsn-node-06.torproject.org
- [ ] fsn-node-07.torproject.org
- [ ] fsn-node-08.torproject.org
- [ ] nevii.torproject.org
- [ ] pauli.torproject.org
- [ ] puppetdb-01.torproject.org

It seems like the bookworm Ganeti upgrade took roughly 10h of work. We ballpark the rest of the upgrade to another 10h of work, so possibly 20h.

11 Tails machines:

- [ ] isoworker1.dragon
- [ ] isoworker2.dragon
- [ ] isoworker3.dragon
- [ ] isoworker4.dragon
- [ ] isoworker5.dragon
- [ ] isoworker6.iguana
- [ ] isoworker7.iguana
- [ ] isoworker8.iguana
- [ ] jenkins.dragon
- [ ] survey.lizard
- [ ] translate.lizard

The challenge with Tails upgrades is the coordination with the Tails team, in particular for the Jenkins upgrades.

Feedback and coordination of this batch happens in issue batch 3.

Cleanup work

Once the upgrade is completed and the entire fleet is again running a single OS, it's time for cleanup. This involves updating configuration files to the new versions and removing old compatibility code in Puppet, removing old container images, and generally wrapping things up.

This process has been historically neglected, but we're hoping to wrap this up, worst case in 2026.

Timeline

  • 2025-Q2
    • W14 (first week of April): installer defaults changed and first tests in production
    • W19 (first week of May): Batch 1 upgrades, TPA machines
    • W20 (second week of May): Batch 1 upgrades, Tails machines
    • W23 (first week of June): Batch 2 upgrades, TPA machines
    • W24 (second week of June): Batch 2 upgrades, Tails machines
  • 2025-Q3 to Q4: Batch 3 upgrades
  • 2026+: cleanup

Deadline

The community has until the beginning of the above timeline to manifest concerns or objections.

Two weeks before performing the upgrades of each batch, a new announcement will be sent with details of the changes and impacted services.

Alternatives considered

Retirements or rebuilds

We do not plan any major upgrade or retirements in the third phase this time.

In the future, we hope to decouple those as much as possible, as the Icinga retirement and Mailman 3 became blockers that slowed down the upgrade significantly for bookworm. In both cases, however, the upgrades were challenging and had to be performed one way or another, so it's unclear if we can optimize this any further.

We are clear, however, that we will not postpone an upgrade for a server retirement. Dangerzone, for example, is scheduled for retirement (TPA-RFC-78) but is still planned as normal above.

Costs

TaskEstimateCertaintyWorst case
Automation20hextreme100h
Installer changes4hlow4.4h
Batch 120hlow22h
Batch 260hmedium90h
Batch 320hhigh40h
Cleanup20hmedium30h
Total144h~high~286h

The entire work here should consist of over 140 hours of work, or 18 days, or about 4 weeks full time. Worst case doubles that.

The above is done in "hours" because that's how we estimated batches in the past, but here's an estimate that's based on the Kaplan-Moss estimation technique.

TaskEstimateCertaintyWorst case
Automation3dextreme15d
Installer changes1dlow1.1d
Batch 13dlow3.3d
Batch 210dmedium20d
Batch 33dhigh6d
Cleanup3dmedium4.5d
Total23d~high~50d

This is roughly equivalent, if a little higher (23 days instead of 18), for example.

It should be noted that automation is not expected to drastically reduce the total time spent in batches (currently 16 days or 100 hours). The main goal of automation is more to reduce the likelihood of catastrophic errors, and make it easier to share our upgrade procedure with the world. We're still hoping to reduce the time spent in batches, hopefully by 10-20%, which would bring the total number of days across batches from 16 days to 14d, or from 100 h to 80 hours.

Approvals required

This proposal needs approval from TPA team members, but service admins can request additional delay if they are worried about their service being affected by the upgrade.

Comments or feedback can be provided in issues linked above, or the general process can be commented on in issue tpo/tpa/team#41990.

References