Summary: merge Tails rotations with TPA's star of the week into a single role, merge Tails and TPA's support policies.

Background

The Tails and Tor merge process created a situation in which there are now two separate infrastructures as well as two separate support processes and policies. The full infrastructure merge is expected to take 5 years to complete, but we want to prioritize merging the teams into a single entity.

Proposal

As much as reasonably possible, every team member should be able to handle issues on both TPA and Tails infrastructure. Decreasing the level of specialization will allow for sharing support workload in a way that is more even and spaced out for all team members.

Goals

Must have

A list of tasks that should be handled during rotations that includes triage, routine tasks and interruption handling and comprises all expectations for both the TPA "star of the week" and the Tails "sysadmin on shift"
A process to make sure every TPA members is able to support both infrastructures
Guidelines for directing users to the correct place or process to get support

Non-Goals

Merging the following is not a goal of this policy:

Tools used by each team
Mailing lists
Technical workflows

The goal is really just to make everyone comfortable to work on both sides of the infra and to merge rotation shifts.

Support tasks

TPA-RFC-2: Support defines different support levels, but in the context of this proposal we use the tasks that are the responsibility of the "star of the week" as a basis for the merge of rotation shifts:

Triage of new issues
Routine tasks
Keep an eye on the monitoring system (karma and #tor-alerts on IRC)
Organise incident response

Tails processes are merged into each of the items above, even though with different timelines.

Triage of new issues

For triage of new issues, we abolish the previous processes used by Tails, and users of Tails services should now:

Stop creating new issues in the tpo/tpa/tails-sysadmin> project, and instead start using the tpo/tpa/team> project or dedicated projects when available (eg. tpo/tpa/puppet-weblate>).
Stop using the ~"To Do" label, and start using per-service labels, when available, or the generic ~"Tails" label when the relevant Tails service doesn't have a specific label.

Triage of Tails issues will follow the same triage process as other TPA issues and, apart from the changes listed above, the process should be the same for any user requesting support.

Routine tasks

The following routine tasks are expected from the Tails Sysadmin on shift:

update ACLs upon request (eg. Gitolite, GitLab, etc)
major upgrades of operating systems
manual upgrades (such as Jenkins, Weblate, etc)
reboot and restart systems for security issues or faults
interface with providers
update GitLab configuration (using gitlab-config)
process abuse reports in Tails' GitLab

Most of these were already described in TPA's "routine" tasks and the ones that were not are now also explicitly included there. Note that, until the infra merge is complete, these tasks will have to be operated in both infras.

The following processes were explicitly mentioned as expectations Tails Sysadmins (not necessarily on shift), and are either superseded by the current processes TPA has in place to organize its work or just made obsolete:

task	action
avoid work duplication	superseded by TPA's triage process and check-ins
support the sysadmin on shift	superseded by TPA's triage process and check-ins
cover for the sysadmin on shift after 48h of MIA	obsolete
self-evaluation of work	obsolete
shift schedule	eventually replaced by TPA rotations ("star of the week")
Jenkins upgrade (including plugins)	absorbed by TPA as a new task
LimeSurvey upgrade	absorbed by TPA with the LimeSurvey merge
Weblate upgrade	absorbed by TPA as a new task

Monitoring system

As per TPA-RFC-73, the plan is to ditch Tails' Icinga2 in favor of Tor's Prometheus, which is blocked by significant part of the Puppet merge.

Asking the TPA crew to get used to Tails Icinga2 in the meantime is not a good option because:

Tor has recently ditched Icinga, and asking them to adopt something like it once again would be demotivating
The system will eventually change anyway and using people's time to adopt it would not be a good investment of resources.

Because of the above, we choose to delay the merge of tasks that depend on the monitoring system until after Puppet is merged and the Tails infra has been been migrated to Prometheus. The estimate is we could start working on the migration of the monitoring system on November 2025, so we should probably not count on having that finished before the end of 2025.

This decision impacts some of the routine tasks (eg. examine disk usage, check for the need of server reboots) and "keeping an eye in the monitoring system" in general. In the meantime, we can merge triage, routine tasks that don't depend on the monitoring system and organization of incident response.

Incident response

Tails doesn't have a formal incident response process, so in this case the TPA process is just adopted as is.

Support merge process

The merge process is incremental:

Phase 0: Separate shifts (this is what happens now)
Phase 1: Triage and organization of incident response
Phase 2: Routine tasks
Phase 3: Merged support

Phase 0 - Separate shifts

This phase corresponds to what happens now: there are 2 different support teams essentially giving support for 2 different infras.

Phase 1 - Triage and organization of incident response

During this period, the TPA star of the week works in conjunction with the Tails Sysadmin on shifts in triage of new issues and organisation of incident response, when needed.

Each week there'll be two people looking at the relevant dashboards, and they should communicate to resolve questions that may arise about triage. Similarly, if there are incidents, they'll coordinate to handle together the organization of responses.

Phase 2 - Routine tasks

Once Tails monitoring has been migrated to Prometheus, the TPA star of the week and the Tails Sysadmin on shift can start collaborating on routine tasks and, when possible, start working on issues related to "each other's infra".

In this phase we still maintain 2 different support calendars, and Tails+Tor support pairs are changed every week according to these calendars.

Note that there are much more support requests on the TPA side, and much less sysadmin hours on the Tails side, so this should be done proportionately. The idea is to allow for smooth onboarding of both teams on both infras, so they should support each other to make sure any questions are answered and any blocks are removed.

Some routine tasks that are not related to monitoring may start earlier than the date we set for Phase 2 in the timeline below. Upgrades to Debian Trixie are one example of activity that will help both teams getting comfortable with each other's infra: "To help with merging rotations in the two teams, TPA staff will upgrade Tails machines, with Tails folks assistance, and vice-versa."

Phase 3 - Merged support

Every TPA member is now able to conduct all routine tasks and handle triage and interrupts in both infrastructures. We abolish the "Tails Sysadmin Shifts" calendar and incorporate all TPA members in the "Star of the week" rotation calendar.

Scope

Affected users

This policy mainly affects TPA members and any user of Tails services that needs to make a support request. Most impacted users are members of the Tails Team, as they are the main users of the Tails services, and, eventually, members of the Community and Fundraising teams, as they're probable users of some of Tails services such as the Tails website and Weblate.

Timeline

Phase	Timeline
Phase 0 - Separate shifts	now - mid-April 2025
Phase 1 - Triage and organization of incident response	mid-April - December 2025
Phase 2 - Routine tasks	January 2026
Phase 3 - Merged support	April 2026

Keyboard shortcuts