Summary: a roadmap for 2024
Proposal
Priorities for 2024
Must have
- Debian 12 bookworm upgrade completion
(50% done) before July 2024 (so Q1-Q2 2024), which includes:
- puppet server 7 upgrade: Q2 2024? (tpo/tpa/team#41321)
- mailman 3 and schleuder upgrade (probably on a new mail server), hopefully Q2 2024 (tpo/tpa/team#40471)
- inciga retirement / migration to Prometheus Q3-Q4 2024? (tpo/tpa/team#40755)
- old services retirement
- SVN retirement (or not): proposal in Q2, execution Q3-Q4? (tpo/tpa/team#40260) Nextcloud will not work after all because of major issues with collaborative editing, need to go back to the drawing board.
- legacy Git infrastructure retirement (TPA-RFC-36), which includes:
- 12 TPA repos to migrate, some complicated (tpo/tpa/team#41219)
- archiving all other repositories (tpo/tpa/team#41215)
- lockdown scheduled for Q2 2024 (tpo/tpa/team#41213)
- email services? includes:
- draft TPA-RFC-45, which may include:
- mailbox hosting in HA
- minio clustering and backups
- make a decision on gitlab ultimate (tpo/team#202)
nice to have
- Puppet CI
- review TPA-RFC process (tpo/tpa/team#41428)
- tiered gitlab runners (tpo/tpa/team#41436)
- improve upgrade (tpo/tpa/team#41485) and install (tpo/tpa/team#31239) automation
- disaster recovery planning (tpo/tpa/team#40628)
- monitor technical debt (tpo/tpa/team#41456)
- review team function and scope (TPA? web? SRE?)
black swans
A black swan event is "an event that comes as a surprise, has a major effect, and is often inappropriately rationalized after the fact with the benefit of hindsight" (Wikipedia). In our case, it's typically an unexpected and unplanned emergency that derails the above plans.
Here are possible changes that are technically not black swans (because they are listed here!) but that could serve as placeholders for the actual events we'll have this year:
- Hetzner evacuation (plan and estimates) (tpo/tpa/team#41448)
- outages, capacity scaling (tpo/tpa/team#41448)
- in general, disaster recovery plans
- possible future changes for internal chat (IRC onboarding?) or sudden requirement to self-host another service currently hosted externally
- some guy named Jerry, who knows!
THE WEB - how we organize it this year
This still need to be discussed and reviewed with isa.
- call for a "web team meeting"
- discuss priorities with that team
- discuss how we are going to organize ourselves
- announce the hiring this year of a web dev
Reviews
This section is used to document what happened in 2024. It has been established (too) late in 2024 but aims at outlining major events that happened during the year:
- legacy Git infrastructure retirement (TPA-RFC-36): repositories have been massively migrated to GitLab's legacy/gitolite namespace
- Debian 12 bookworm upgrade: currently incomplete (12 hosts left or about 13% of the fleet), but hoping to complete before the end of 2024
- work started on upgrading the legacy mail server and improving deliverability of mail forwards (TPA-RFC-71: emergency email deployments, phase B)
- includes a major upgrade from Mailman 2 to Mailman 3
- Improved by 66% build performance on Lektor websites with i18n (https://gitlab.torproject.org/tpo/web/lego/-/issues/30)
- Retired Nagios in favor of a Prometheus-based alerting setup, with less noises, faster detection, and better coverage (TPA-RFC-33-A: emergency Icinga retirement)
- new donate site!! + dashboard
Other notable RFCs:
- TPA-RFC-60: GitLab 2-factor authentication enforcement: enable 2-factor authentication (2fa) enforcement on the GitLab tpo group and subgroups
- TPA-RFC-62: TPA password manager : switch from pwstore to password-store for (and only for) TPA passwords
- TPA-RFC-63: buy a new backup storage server: a new 80TB (4 drives, expandable to 8) backup server in the secondary location for disaster recovery and the new metrics storage service
- TPA-RFC-64: Puppet TLS certificates: Move from letsencrypt-domains.git to Puppet to manage TLS certificates
- TPA-RFC-68: Idle canary servers: provision test servers that sit idle to monitor infrastructure and stage deployments
- TPA-RFC-67: Retire mini-nag, a legacy extra monitoring system that became unnecessary thanks to "happy eyeballs" implementations, see tpo/tpa/team#41766 for details
Next steps:
- 2025 roadmap still in progress, input welcome, likely going to include putting MinIO in production and figuring out what to do with SVN, alongside cleaning up and publishing our Puppet codebase
- Started merge with Tails! Some services were retired or merged already, but we're mostly at the planning stage, see https://gitlab.torproject.org/tpo/tpa/team/-/issues/41721
- bookworm upgrade completion, considering trixie upgrades in 2025
References
Previous roadmap established in TPA-RFC-42 and is in roadmap/2023.
Discussion about this proposal are in tpo/tpa/team#41436.
See also the week-by-week planning spreadsheet.