This page aims at documenting the upgrade procedure, known problems and upgrade progress of the fleet. Progress is mainly tracked in the %Debian 13 trixie upgrade milestone, but there's a section at the end of this document tracking actual numbers over time.
- Procedure
- Service-specific upgrade procedures
- Issues
- Notable changes
- Troubleshooting
- References
- Fleet-wide changes
- Per host progress
Procedure
This procedure is designed to be applied, in batch, on multiple servers. Do NOT follow this procedure unless you are familiar with the command line and the Debian upgrade process. It has been crafted by and for experienced system administrators that have dozens if not hundreds of servers to upgrade.
In particular, it runs almost completely unattended: configuration changes are not prompted during the upgrade, and just not applied at all, which will break services in many cases. We use a clean-conflicts script to do this all in one shot to shorten the upgrade process (without it, configuration file changes stop the upgrade at more or less random times). Then those changes get applied after a reboot. And yes, that's even more dangerous.
See the "conflicts resolution" section below for how to handle
clean_conflicts output.
Preparation
- Ensure that there are up-to-date backups for the host. This means you should
manually run:
- a system-wide backup for the host
- any other relevant backups such as, for example, a PostgreSQL backup
- Check the release notes for the services running in the host
- Check whether there are debian bugs or relevant notes on the
README.Debianfile for important packages that are specific to the host
Automated procedure
Starting from Trixie, TPA started scripting the upgrade procedure
altogether, which now lives in Fabric, under the upgrade.major
task, and is being tested.
In general, you should be able to run this from your workstation:
cd fabric-tasks
ttyrec -a -e tmux major-upgrade.log
fab -H test-01.torproject.org upgrade.major
If a step fails, you can resume from that step with:
fab -H test-01.torproject.org upgrade.major --start=4
By default, the script will be more careful: it will run upgrades in
two stages, and prompt for NEWS items (but not config file diffs). You
can skip those (and have the NEWS items logged instead) by using the
--reckless flag. The --autopurge flag also cleans up stale
packages at the end automatically.
Legacy procedure
IMPORTANT NOTE: This procedure is currently being rewritten as a Fabric job, see above.
-
Preparation:
echo reset to the default locale && export LC_ALL=C.UTF-8 && echo install some dependencies && sudo apt install ttyrec screen debconf-utils && echo create ttyrec file with adequate permissions && sudo touch /var/log/upgrade-trixie.ttyrec && sudo chmod 600 /var/log/upgrade-trixie.ttyrec && sudo ttyrec -a -e screen /var/log/upgrade-trixie.ttyrec -
Backups and checks:
( umask 0077 && tar cfz /var/backups/pre-trixie-backup.tgz /etc /var/lib/dpkg /var/lib/apt/extended_states /var/cache/debconf $( [ -e /var/lib/aptitude/pkgstates ] && echo /var/lib/aptitude/pkgstates ) && dpkg --get-selections "*" > /var/backups/dpkg-selections-pre-trixie.txt && debconf-get-selections > /var/backups/debconf-selections-pre-trixie.txt ) && : lock down puppet-managed postgresql version && ( if jq -re '.resources[] | select(.type=="Class" and .title=="Profile::Postgresql") | .title' < /var/lib/puppet/client_data/catalog/$(hostname -f).json; then echo "tpa_preupgrade_pg_version_lock: '$(ls /var/lib/postgresql | grep '[0-9][0-9]*' | sort -n | tail -1)'" > /etc/facter/facts.d/tpa_preupgrade_pg_version_lock.yaml; fi ) && : pre-upgrade puppet run ( puppet agent --test || true ) && apt-mark showhold && dpkg --audit && echo look for dkms packages and make sure they are relevant, if not, purge. && ( dpkg -l '*dkms' || true ) && echo look for leftover config files && /usr/local/sbin/clean_conflicts && echo make sure backups are up to date in Bacula && printf "End of Step 2\a\n" -
Enable module loading (for Ferm), disable Puppet and test reboots:
systemctl disable modules_disabled.timer && puppet agent --disable "running major upgrade" && shutdown -r +1 "trixie upgrade step 3: rebooting with module loading enabled"To put server in maintenance here, you need to silence the alerts related to that host, for example with this Fabric task, locally:
fab silence.create -m 'alias=idle-fsn-01.torproject.org' --comment "performing major upgrade"You can do all of this with the reboot job:
fab -H test-01.torproject.org fleet.reboot-host \ --delay-shutdown-minutes=1 \ --reason="bookworm upgrade step 3: rebooting with module loading enabled" \ --force \ --silence-ends-at="in 1 hour" -
Perform any pending upgrade and clear out old pins:
export LC_ALL=C.UTF-8 && sudo ttyrec -a -e screen /var/log/upgrade-trixie.ttyrec apt update && apt -y upgrade && echo Check for pinned, on hold, packages, and possibly disable && rm -f /etc/apt/preferences /etc/apt/preferences.d/* && rm -f /etc/apt/sources.list.d/backports.debian.org.list && rm -f /etc/apt/sources.list.d/backports.list && rm -f /etc/apt/sources.list.d/trixie.list && rm -f /etc/apt/sources.list.d/bookworm.list && rm -f /etc/apt/sources.list.d/*-backports.list && rm -f /etc/apt/sources.list.d/experimental.list && rm -f /etc/apt/sources.list.d/incoming.list && rm -f /etc/apt/sources.list.d/proposed-updates.list && rm -f /etc/apt/sources.list.d/sid.list && rm -f /etc/apt/sources.list.d/testing.list && echo purge removed packages && apt purge $(dpkg -l | awk '/^rc/ { print $2 }') && echo purge obsolete packages && apt purge '?obsolete' && echo autoremove packages && apt autoremove -y --purge && echo possibly clean up old kernels && dpkg -l 'linux-image-*' && echo look for packages from backports, other suites or archives && echo if possible, switch to official packages by disabling third-party repositories && apt list "?narrow(?installed, ?not(?codename($(lsb_release -c -s | tail -1))))" && printf "End of Step 4\a\n" -
Check free space (see this guide to free up space), disable auto-upgrades, and download packages:
systemctl stop apt-daily.timer && sed -i 's#bookworm-security#trixie-security#' $(ls /etc/apt/sources.list /etc/apt/sources.list.d/*) && sed -i 's/bookworm/trixie/g' $(ls /etc/apt/sources.list /etc/apt/sources.list.d/*) && apt update && apt -y -d full-upgrade && apt -y -d upgrade && apt -y -d dist-upgrade && df -h && printf "End of Step 5\a\n" -
Actual upgrade step.
Optional, minimal upgrade run (avoids new installs or removals):
sudo touch /etc/nologin && env DEBIAN_FRONTEND=noninteractive APT_LISTCHANGES_FRONTEND=log APT_LISTBUGS_FRONTEND=none UCF_FORCE_CONFFOLD=y \ apt upgrade --without-new-pkgs -y -o Dpkg::Options::='--force-confdef' -o Dpkg::Options::='--force-confold'Full upgrade:
sudo touch /etc/nologin && env DEBIAN_FRONTEND=noninteractive APT_LISTCHANGES_FRONTEND=log APT_LISTBUGS_FRONTEND=none UCF_FORCE_CONFFOLD=y \ apt full-upgrade -y -o Dpkg::Options::='--force-confdef' -o Dpkg::Options::='--force-confold' && printf "End of Step 6\a\n"If this is a sensitive server, consider
APT_LISTCHANGES_FRONTEND=pagerand reviewing the NEWS files before continuing. -
Post-upgrade procedures:
: review the NEWS items && if [ -f /var/log/apt/listchanges.log ] ; then less /var/log/apt/listchanges.log; fi && apt-get update --allow-releaseinfo-change && puppet agent --enable && puppet agent -t --noop && printf "Press enter to continue, Ctrl-C to abort." && read -r _ && (puppet agent -t || true) && echo deploy upgrades after possible Puppet sources.list changes && apt update && apt upgrade -y && rm -f \ /etc/ssh/ssh_config.dpkg-dist \ /etc/syslog-ng/syslog-ng.conf.dpkg-dist \ /etc/ca-certificates.conf.dpkg-old \ /etc/cron.daily/bsdmainutils.dpkg-remove \ /etc/systemd/system/fstrim.timer \ /etc/apt/apt.conf.d/50unattended-upgrades.ucf-dist \ /etc/bacula/bacula-fd.conf.ucf-dist \ && printf "\a" && /usr/local/sbin/clean_conflicts && systemctl start apt-daily.timer && rm /etc/nologin && printf "End of Step 7\a\n"Reboot the host from Fabric:
fab -H test-01.torproject.org fleet.reboot-host \ --delay-shutdown-minutes=1 \ --reason="major upgrade: removing old kernel image" \ --force \ --silence-ends-at="in 1 hour" -
Service-specific upgrade procedures
If the server is hosting a more complex service, follow the right Service-specific upgrade procedures
IMPORTANT: make sure you test the services at this point, or at least notify the admins responsible for the service so they do so. This will allow new problems that developed due to the upgrade to be found earlier.
-
Post-upgrade cleanup:
export LC_ALL=C.UTF-8 && sudo ttyrec -a -e screen /var/log/upgrade-trixie.ttyrec echo consider apt-mark minimize-manual apt-mark manual bind9-dnsutils && apt purge apt-forktracer && echo purging removed packages && apt purge '~c' && apt autopurge && echo trying a deborphan replacement && apt-mark auto '~i !~M (~slibs|~soldlibs|~sintrospection)' && apt-mark auto $(apt search 'transition(|n)($|ing|al|ary| package| purposes)' | grep '^[^ ].*\[installed' | sed 's,/.*,,') && apt-mark auto $(apt search dummy | grep '^[^ ].*\[installed' | sed 's,/.*,,') && apt autopurge && echo review obsolete and odd packages && apt purge '?obsolete' && apt autopurge && apt list "?narrow(?installed, ?not(?codename($(lsb_release -c -s | tail -1))))" && apt clean && echo review installed kernels: && dpkg -l 'linux-image*' | less && printf "End of Step 9\a\n"One last reboot, with Fabric:
fab -H test-01.torproject.org fleet.reboot-host \ --delay-shutdown-minutes=1 \ --reason="last major upgrade step: testing reboots one final time" \ --force \ --silence-ends-at="in 1 hour"On PostgreSQL servers that have the
apt.postgresql.orgsources.list, you also need to downgrade to the trixie versions:apt install \ postgresql-17=17.4-2 \ postgresql-client-17=17.4-2 \ postgresql=17+277 \ postgresql-client-common=277 \ postgresql-common=277 \ postgresql-common-dev=277 \ libpq5=17.4-2 \ pgbackrest=2.54.2-1 \ pgtop=4.1.1-1 \ postgresql-client=17+277 \ python3-psycopg2=2.9.10-1+b1Note the above should be better done with pins (and that's done in the Fabric task).
Conflicts resolution
When the clean_conflicts script gets run, it asks you to check each
configuration file that was modified locally but that the Debian
package upgrade wants to overwrite. You need to make a decision on
each file. This section aims to provide guidance on how to handle
those prompts.
Those config files should be manually checked on each host:
/etc/default/grub.dpkg-dist
/etc/initramfs-tools/initramfs.conf.dpkg-dist
The grub config file, in particular, should be restored to the
upstream default and host-specific configuration moved to the grub.d
directory.
All of the following files can be kept as current (choose "N" when asked) because they are all managed by Puppet:
/etc/puppet/puppet.conf
/etc/default/puppet
/etc/default/bacula-fd
/etc/ssh/sshd_config
/etc/syslog-ng/syslog-ng.conf
/etc/ldap/ldap.conf
/etc/ntpsec/ntp.conf
/etc/default/ntpsec
/etc/ssh/ssh_config
/etc/bacula/bacula-fd.conf
/etc/apt/apt.conf.d/50unattended-upgrades
The following files should be replaced by the upstream version (choose "Y" when asked):
/etc/ca-certificates.conf
If other files come up, they should be added in the above decision
list, or in an operation in step 2 or 7 of the above procedure, before
the clean_conflicts call.
Files that should be updated in Puppet are mentioned in the Issues section below as well.
Service-specific upgrade procedures
In general, each service MAY require special considerations when upgrading. Each service page should have an "upgrades" section that documents such procedure.
Those were previously documented here, in the major upgrade procedures, but in the future should be in the service pages.
Here is a list of particularly well known procedures:
- Ganeti
- PostgreSQL
- Puppet (see bookworm, to be moved in service page)
- RT
Issues
See the list of issues in the milestone and also the official list of known issues. We used to document issues here, but now create issues in GitLab instead.
Resolved
needrestart failure
The following error may pop up during execution of apt but will get resolved later on:
Error: Problem executing scripts DPkg::Post-Invoke 'test -x /usr/sbin/needrestart && /usr/sbin/needrestart -o -klw | sponge /var/lib/prometheus/node-exporter/needrestart.prom'
Error: Sub-process returned an error code
Notable changes
Here is a list of notable changes from a system administration perspective:
- TODO
See also the wiki page about trixie for another list.
New packages
TODO
Updated packages
This table summarizes package changes that could be interesting for our project.
| Package | 12 (bookworm) | 13 (trixie) |
|---|---|---|
| Ansible | 7.7 | 11.2 |
| Apache | 2.4.62 | 2.4.63 |
| Bash | 5.2.15 | 5.2.37 |
| Bind | 9.18 | 9.20 |
| Emacs | 28.2 | 30.1 |
| Firefox | 115 | 128 |
| Fish | 3.6 | 4.0 |
| Git | 2.39 | 2.45 |
| GCC | 12.2 | 14.2 |
| Golang | 1.19 | 1.24 |
| Linux kernel | 6.1 | 6.12 |
| LLVM | 14 | 19 |
| MariaDB | 10.11 | 11.4 |
| Nginx | 1.22 | 1.26 |
| OpenJDK | 17 | 21 |
| OpenLDAP | 2.5.13 | 2.6.9 |
| OpenSSL | 3.0 | 3.4 |
| OpenSSH | 9.2 | 9.9 |
| PHP | 8.2 | 8.4 |
| Podman | 4.3 | 5.4 |
| PostgreSQL | 15 | 17 |
| Prometheus | 2.42 | 2.53 |
| Puppet | 7 | 8 |
| Python | 3.11 | 3.13 |
| Rustc | 1.63 | 1.85 |
| Vim | 9.0 | 9.1 |
See the official release notes for the full list from Debian.
Removed packages
- deborphan was removed (1065310), which led to changes in our upgrade procedure, but it's incomplete, see anarcat's notes
See also the noteworthy obsolete packages list.
Deprecation notices
TODO
Troubleshooting
Upgrade failures
Instructions on errors during upgrades can be found in the release notes troubleshooting section.
Reboot failures
If there's any trouble during reboots, you should use some recovery system. The release notes actually have good documentation on that, on top of "use a live filesystem".
References
- Official guide (TODO: review)
- Release notes (TODO: review)
- DSA guide (WIP, last checked 2025-04-16)
- anarcat guide (last sync 2025-04-16)
- Solution proposal to automate this
Fleet-wide changes
The following changes need to be performed once for the entire fleet, generally at the beginning of the upgrade process.
installer changes
The installer need to be changed to support the new release. This includes:
- the Ganeti installers (add a
gnt-instance-debootstrapvariant,modules/profile/manifests/ganeti.ppintor-puppet.git, see commit 4d38be42 for an example) - the wiki documentation:
- create a new page like this one documenting the process, linked from howto/upgrades
- make an entry in the
data.csvto start tracking progress (see below), copy theMakefileas well, changing the suite name - change the Ganeti procedure so that the new suite is used by default
- change the Hetzner robot install procedure
fabric-tasksand the fabric installer
Debian archive changes
The Debian archive on db.torproject.org (currently alberti) need to
have a new suite added. This can be (partly) done by editing files
/srv/db.torproject.org/ftp-archive/. Specifically, the two following
files need to be changed:
apt-ftparchive.config: a new stanza for the suite, basically copy-pasting from a previous entry and changing the suiteMakefile: add the new suite to the for loop
But it is not enough: the directory structure need to be crafted by hand as well. A simple way to do so is to replicate a previous release structure:
cd /srv/db.torproject.org/ftp-archive
rsync -a --include='*/' --exclude='*' archive/dists/bookworm/ archive/dists/trixie/
Then you also need to modify the Release file to point at the new
release code name (in this case trixie).
Those were completed as of 2025-04-16.
Per host progress
Note that per-host upgrade policy is in howto/upgrades.
When a critical mass of servers have been upgraded and only "hard" ones remain, they can be turned into tickets and tracked in GitLab. In the meantime...
A list of servers to upgrade can be obtained with:
curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=nodes { facts { name = "lsbdistcodename" and value != "bookworm" }}' | jq .[].certname | sort
Or in Prometheus:
count(node_os_info{version_id!="11"}) by (alias)
Or, by codename, including the codename in the output:
count(node_os_info{version_codename!="bookworm"}) by (alias,version_codename)
The above graphic shows the progress of the migration between major releases. It can be regenerated with the predict-os script. It pulls information from puppet to update a CSV file to keep track of progress over time.