title: "TPA-RFC-80: Debian 13 ("trixie") upgrade schedule" costs: staff, 4+ weeks approval: TPA, service admins affected users: TPA, service admins deadline: 2 weeks, 2025-04-01 status: standard discussion: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41990

Summary: start upgrading servers during the Debian 13 ("trixie") freeze, if it goes well, complete most of the fleet upgrade in around June 2025, with full completion by the end of 2025, with a 2026 year free of major upgrades entirely. Improve automation and cleanup old code.

Background
Proposal
Alternatives considered
- Retirements or rebuilds
Costs
Approvals required
References

Background

Debian 13 ("trixie"), currently "testing", is going into freeze soon, which means we should have a new Debian stable release in 2025. It has been a long-standing tradition at TPA to collaborate in the Debian development process and part of that process is to upgrade our servers during the freeze. Upgrading during the freeze makes it easier for us to fix bugs as we find them and contribute them to the community.

The freeze dates announced by the debian.org release team are:

2025-03-15      - Milestone 1 - Transition and toolchain freeze
2025-04-15      - Milestone 2 - Soft Freeze
2025-05-15      - Milestone 3 - Hard Freeze - for key packages and
                                packages without autopkgtests
To be announced - Milestone 4 - Full Freeze

We have entered the "transition and toolchain freeze" which locks changes on packages like compilers and interpreters unless exceptions. See the Debian freeze policy for an explanation of each step.

Even though we've just completed the Debian 11 ("bullseye") and 12 ("bookworm") upgrades in late 2024, we feel it's a good idea to start and complete the Debian 13 upgrades in 2025. That way, we can hope of having a year or two (2026-2027?) without any major upgrades.

This proposal is part of the Debian 13 trixie upgrade milestone, itself part of the 2025 TPA roadmap.

Proposal

As usual, we perform the upgrades in three batches, in increasing order of complexity, starting in 2025Q2, hoping to finish by the end of 2025.

Note that, this year, this proposal also includes upgrading the Tails infrastructure as well. To help with merging rotations in the two teams, TPA staff will upgrade Tails machines, with Tails folks assistance, and vice-versa.

Affected users

All service admins are affected by this change. If you have shell access on any TPA server, you want to read this announcement.

In the past, TPA has typically kept a page detailing notable changes and a proposal like this one would link against the upstream release notes. Unfortunately, at the time writing, upstream hasn't yet produced release notes (as we're still in testing).

We're hoping the documentation will be refined by the time we're ready to coordinate the second batch of updates, around May 2025, when we will send reminders to affected teams.

We do expect the Debian 13 upgrade to be less disruptive than bookworm, mainly because Python 2 is already retired.

Notable changes

For now, here are some known changes that are already in Debian 13:

Package	12 (bookworm)	13 (trixie)
Ansible	7.7	11.2
Apache	2.4.62	2.4.63
Bash	5.2.15	5.2.37
Emacs	28.2	30.1
Fish	3.6	4.0
Git	2.39	2.45
GCC	12.2	14.2
Golang	1.19	1.24
Linux kernel image	6.1 series	6.12 series
LLVM	14	19
MariaDB	10.11	11.4
Nginx	1.22	1.26
OpenJDK	17	21
OpenLDAP	2.5.13	2.6.9
OpenSSL	3.0	3.4
PHP	8.2	8.4
Podman	4.3	5.4
PostgreSQL	15	17
Prometheus	2.42	2.53
Puppet	7	8
Python	3.11	3.13
Rustc	1.63	1.85
Vim	9.0	9.1

Most of those, except "tool chains" (e.g. LLVM/GCC) can still change, as we're not in the full freeze yet.

Upgrade schedule

The upgrade is split in multiple batches:

automation and installer changes
low complexity: mostly TPA services and less critical Tails servers
moderate complexity: TPA "service admins" machines and remaining Tails physical servers and VMs running services from the official Debian repositories only
high complexity: Tails VMs running services not from the official Debian repositories
cleanup

The free time between the first two batches will also allow us to cover for unplanned contingencies: upgrades that could drag on and other work that will inevitably need to be performed.

The objective is to do the batches in collective "upgrade parties" that should be "fun" for the team. This policy has proven to be effective in the previous upgrades and we are eager to repeat it again.

Upgrade automation and installer changes

First, we tweak the installers to deploy Debian 13 by default to avoid installing further "old" systems. This includes the bare-metal installers but also and especially the virtual machine installers and default container images.

Concretely, we're planning on changing the stable container image tag to point to trixie in early April. We will be working on a retirement policy for container images later, as we do not want to bury that important (and new) policy here. For now, you should assume that bullseye images are going to go away soon (tpo/tpa/base-images#19), but a separate announcement will be issued for this (tpo/tpa/base-images#24).

New idle canary servers will be setup in Debian 13 to test integration with the rest of the infrastructure, and future new machine installs will be done in Debian 13.

We also want to work on automating the upgrade procedure further. We've had catastrophic errors in the PostgreSQL upgrade procedure in the past, in particular, but the whole procedure is now considered ripe for automation, see tpo/tpa/team#41485 for details.

Batch 1: low complexity

This is scheduled during two weeks: TPA boxes will be upgraded in the last week of April, and Tails in the first week of May.

The idea is to start the upgrade long enough before the vacations to give us plenty of time to recover, and some room to start the second batch.

In April, Debian should also be in "soft freeze", not quite a fully "stable" environment, but that should be good enough for simple setups.

36 TPA machines:

- [ ] archive-01.torproject.org
- [ ] cdn-backend-sunet-02.torproject.org
- [ ] chives.torproject.org
- [ ] dal-rescue-01.torproject.org
- [ ] dal-rescue-02.torproject.org
- [ ] gayi.torproject.org
- [ ] hetzner-hel1-02.torproject.org
- [ ] hetzner-hel1-03.torproject.org
- [ ] hetzner-nbg1-01.torproject.org
- [ ] hetzner-nbg1-02.torproject.org
- [ ] idle-dal-02.torproject.org
- [ ] idle-fsn-01.torproject.org
- [ ] lists-01.torproject.org
- [ ] loghost01.torproject.org
- [ ] mandos-01.torproject.org
- [ ] media-01.torproject.org
- [ ] metricsdb-01.torproject.org
- [ ] minio-01.torproject.org
- [ ] mta-dal-01.torproject.org
- [ ] mx-dal-01.torproject.org
- [ ] neriniflorum.torproject.org
- [ ] ns3.torproject.org
- [ ] ns5.torproject.org
- [ ] palmeri.torproject.org
- [ ] perdulce.torproject.org
- [ ] srs-dal-01.torproject.org
- [ ] ssh-dal-01.torproject.org
- [ ] static-gitlab-shim.torproject.org
- [ ] staticiforme.torproject.org
- [ ] static-master-fsn.torproject.org
- [ ] submit-01.torproject.org
- [ ] vault-01.torproject.org
- [ ] web-dal-07.torproject.org
- [ ] web-dal-08.torproject.org
- [ ] web-fsn-01.torproject.org
- [ ] web-fsn-02.torproject.org

4 Tails machines:

ecours.tails.net
puppet.lizard
skink.tails.net
stone.tails.net

In the first batch of bookworm machines, we ended up taking 20 minutes per machine, done in a single day, but warned that the second batch took longer.

It's probably safe to estimate 20 hours (30 minutes per machine) for this work, in a single week.

Feedback and coordination of this batch happens in issue batch 1.

Batch 2: moderate complexity

This is scheduled for the last week of may for TPA machines, and the first week of June for Tails.

At this point, Debian testing should be in "hard freeze", which should be more stable.

39 TPA machines:

- [ ] anonticket-01.torproject.org
- [ ] backup-storage-01.torproject.org
- [ ] bacula-director-01.torproject.org
- [ ] btcpayserver-02.torproject.org
- [ ] bungei.torproject.org
- [ ] carinatum.torproject.org
- [ ] check-01.torproject.org
- [ ] ci-runner-x86-02.torproject.org
- [ ] ci-runner-x86-03.torproject.org
- [ ] colchicifolium.torproject.org
- [ ] collector-02.torproject.org
- [ ] crm-int-01.torproject.org
- [ ] dangerzone-01.torproject.org
- [ ] donate-01.torproject.org
- [ ] donate-review-01.torproject.org
- [ ] forum-01.torproject.org
- [ ] gitlab-02.torproject.org
- [ ] henryi.torproject.org
- [ ] materculae.torproject.org
- [ ] meronense.torproject.org
- [ ] metricsdb-02.torproject.org
- [ ] metrics-store-01.torproject.org
- [ ] onionbalance-02.torproject.org
- [ ] onionoo-backend-03.torproject.org
- [ ] polyanthum.torproject.org
- [ ] probetelemetry-01.torproject.org
- [ ] rdsys-frontend-01.torproject.org
- [ ] rdsys-test-01.torproject.org
- [ ] relay-01.torproject.org
- [ ] rude.torproject.org
- [ ] survey-01.torproject.org
- [ ] tbb-nightlies-master.torproject.org
- [ ] tb-build-02.torproject.org
- [ ] tb-build-03.torproject.org
- [ ] tb-build-06.torproject.org
- [ ] tb-pkgstage-01.torproject.org
- [ ] tb-tester-01.torproject.org
- [ ] telegram-bot-01.torproject.org
- [ ] weather-01.torproject.org

17 Tails machines:

apt-proxy.lizard
apt.lizard
bitcoin.lizard
bittorrent.lizard
bridge.lizard
dns.lizard
dragon.tails.net
gitlab-runner.iguana
iguana.tails.net
lizard.tails.net
mail.lizard
misc.lizard
puppet-git.lizard
rsync.lizard
teels.tails.net
whisperback.lizard
www.lizard

The second batch of bookworm upgrades took 33 hours for 31 machines, so about one hour per box. Here we have 57 machines, so it will likely take us 60 hours (or two weeks) to complete the upgrade.

Feedback and coordination of this batch happens in issue batch 2.

Batch 3: high complexity

Those machines are harder to upgrade, or more critical. In the case of TPA machines, we typically regroup the Ganeti servers and all the "snowflake" servers that are not properly Puppetized and full of legacy, namely the LDAP, DNS, and Puppet servers.

That said, we waited a long time to upgrade the Ganeti cluster for bookworm, and it turned out to be trivial, so perhaps those could eventually be made part of the second batch.

15 TPA machines:

- [ ] alberti.torproject.org
- [ ] dal-node-01.torproject.org
- [ ] dal-node-02.torproject.org
- [ ] dal-node-03.torproject.org
- [ ] fsn-node-01.torproject.org
- [ ] fsn-node-02.torproject.org
- [ ] fsn-node-03.torproject.org
- [ ] fsn-node-04.torproject.org
- [ ] fsn-node-05.torproject.org
- [ ] fsn-node-06.torproject.org
- [ ] fsn-node-07.torproject.org
- [ ] fsn-node-08.torproject.org
- [ ] nevii.torproject.org
- [ ] pauli.torproject.org
- [ ] puppetdb-01.torproject.org

It seems like the bookworm Ganeti upgrade took roughly 10h of work. We ballpark the rest of the upgrade to another 10h of work, so possibly 20h.

11 Tails machines:

- [ ] isoworker1.dragon
- [ ] isoworker2.dragon
- [ ] isoworker3.dragon
- [ ] isoworker4.dragon
- [ ] isoworker5.dragon
- [ ] isoworker6.iguana
- [ ] isoworker7.iguana
- [ ] isoworker8.iguana
- [ ] jenkins.dragon
- [ ] survey.lizard
- [ ] translate.lizard

The challenge with Tails upgrades is the coordination with the Tails team, in particular for the Jenkins upgrades.

Feedback and coordination of this batch happens in issue batch 3.

Cleanup work

Once the upgrade is completed and the entire fleet is again running a single OS, it's time for cleanup. This involves updating configuration files to the new versions and removing old compatibility code in Puppet, removing old container images, and generally wrapping things up.

This process has been historically neglected, but we're hoping to wrap this up, worst case in 2026.

Timeline

2025-Q2
- W14 (first week of April): installer defaults changed and first tests in production
- W19 (first week of May): Batch 1 upgrades, TPA machines
- W20 (second week of May): Batch 1 upgrades, Tails machines
- W23 (first week of June): Batch 2 upgrades, TPA machines
- W24 (second week of June): Batch 2 upgrades, Tails machines
2025-Q3 to Q4: Batch 3 upgrades
2026+: cleanup

Deadline

The community has until the beginning of the above timeline to manifest concerns or objections.

Two weeks before performing the upgrades of each batch, a new announcement will be sent with details of the changes and impacted services.

Alternatives considered

Retirements or rebuilds

We do not plan any major upgrade or retirements in the third phase this time.

In the future, we hope to decouple those as much as possible, as the Icinga retirement and Mailman 3 became blockers that slowed down the upgrade significantly for bookworm. In both cases, however, the upgrades were challenging and had to be performed one way or another, so it's unclear if we can optimize this any further.

We are clear, however, that we will not postpone an upgrade for a server retirement. Dangerzone, for example, is scheduled for retirement (TPA-RFC-78) but is still planned as normal above.

Costs

Task	Estimate	Certainty	Worst case
Automation	20h	extreme	100h
Installer changes	4h	low	4.4h
Batch 1	20h	low	22h
Batch 2	60h	medium	90h
Batch 3	20h	high	40h
Cleanup	20h	medium	30h
Total	144h	~high	~286h

The entire work here should consist of over 140 hours of work, or 18 days, or about 4 weeks full time. Worst case doubles that.

The above is done in "hours" because that's how we estimated batches in the past, but here's an estimate that's based on the Kaplan-Moss estimation technique.

Task	Estimate	Certainty	Worst case
Automation	3d	extreme	15d
Installer changes	1d	low	1.1d
Batch 1	3d	low	3.3d
Batch 2	10d	medium	20d
Batch 3	3d	high	6d
Cleanup	3d	medium	4.5d
Total	23d	~high	~50d

This is roughly equivalent, if a little higher (23 days instead of 18), for example.

It should be noted that automation is not expected to drastically reduce the total time spent in batches (currently 16 days or 100 hours). The main goal of automation is more to reduce the likelihood of catastrophic errors, and make it easier to share our upgrade procedure with the world. We're still hoping to reduce the time spent in batches, hopefully by 10-20%, which would bring the total number of days across batches from 16 days to 14d, or from 100 h to 80 hours.

Approvals required

This proposal needs approval from TPA team members, but service admins can request additional delay if they are worried about their service being affected by the upgrade.

Comments or feedback can be provided in issues linked above, or the general process can be commented on in issue tpo/tpa/team#41990.

Keyboard shortcuts

title: "TPA-RFC-80: Debian 13 ("trixie") upgrade schedule" costs: staff, 4+ weeks approval: TPA, service admins affected users: TPA, service admins deadline: 2 weeks, 2025-04-01 status: standard discussion: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41990