Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Summary: migrate all Git storage to the new gitaly-01 back-end, each Git repository read-only during its migration, in the coming week.

Proposal

Move all Git repositories to the new Gitaly server during Week 29, progressively, which means it will be impossible to push new commits to a repository while it is migrated.

This should be a series of short (seconds to minutes), scoped outage, as each repository is marked as read-only one at a time when it's migrated, see "impact" below on what that means more precisely.

The Gitaly migration procedure seems well test and robust, as each repository is checkedsummed before and after migration.

We are hoping this will improve overall performance on the GitLab server, and is part of the design upstream GitLab suggests in scaling an installation of our size.

Affected projects

We plan on migrating the following name spaces in order:

alpha phase, day one (2025-07-14)

This is mostly dogfooding and automation:

  1. anarcat (already done)
  2. tpo/tpa
  3. tpo/web

beta phase, day two (2025-07-15)

This is to include testers outside of TPA yet on projects that are less mission critical and could survive some issues with their Git repositories.

  1. tpo/community
  2. tpo/onion-services
  3. tpo/anti-censorship
  4. tpo/network-health

production phase, day two or three (2025-07-15+)

This is essentially all remaining projects:

  1. tpo/core (includes c-tor and Arti!)
  2. tpo/applications (includes Tor Browser and Mullvad Browser)
  3. all remaining projects

Objections and exceptions

If you do not want any such disruption in your project, please let us know before the deadline (2025-07-15) so we can skip your project. But we would rather migrate all projects off of the server to simplify the architecture and better understand the impact of the change.

We would like, in particular, to migrate all of tpo/applications repositories in the coming week.

Inversely, if you want your project to be prioritized (it might mean a performance improvement!), let us know and you can jump the queue!

Impact

Projects read-only during migration

While a project is migrated, it is "read-only", that is no change can be done to the Git repository.

We believe that other features in projects (like issues and comments) should still work, but the upstream documentation on this is not exactly clear:

To ensure data integrity, projects are put in a temporary read-only state for the duration of the move. During this time, users receive a The repository is temporarily read-only. Please try again later. message if they try to push new commits.

So far our test migrations have been so fast (a couple of seconds per project) that we have not really been able to test this properly.

Effectively, we don't expect users to actually notice this migration. In our tests, a 120MB repository was migrated in a couple of seconds, so apart from very large repositories, most read-only situations should be limited to less than a minute.

It is estimated that our largest repositories (the Firefox forks) will take a 5 to 10 minutes to migrate, and that the entire migration will take, in total, less than 2 hours to shift between the two servers if it would performed in one shot.

Additional complexity for TPA

TPA will need to get familiar with this new service. Installation documentation is available and all the code developed to deploy the service is visible in an internal merge request.

I understand this is a big change right before going on vacation, so any TPA member can veto this and switch to the alternative, a partial or on-demand migration.

Timeline

We plan on starting this work on July 15th, the coming Tuesday.

Hardware

Like the current git repositories on gitlab-02 the git repositories on gitaly-01 will be hosted on NVMe disks.

Background

GitLab has been having performance problems for a long time now. And for almost as long, we've had the project to "scale GitLab to 2,000 users" (tpo/tpa/team#40479). And while we believe bots (and now, in particular Large Language Models (LLM) bot nets) are responsible for a lot of that load, our last performance incident concluded by observing that there seems to be a correlation between real usage and performance issues.

Indeed, during the July break, GitLab's performance was stellar and, on Monday, as soon as Europe woke up from the break, GitLab's performance collapsed again. And while it's possible that bots are driven by the same schedule as Tor people, we now feel it's simply time to scale the resources associated with one of our most important services.

Gitaly is GitLab's implementation of a Git server. It's basically a web interface to translate (GRPC) requests into Git. It's currently running on the same server as the main GitLab app, but a new server has been built. New servers could be built as needed as well.

Anarcat performed benchmarks showing equivalent or better performance of the new Gitaly server, even when influenced by the load of the current GitLab server. It is expected the new server should reduce the load on the main GitLab server, but it's not clear by how much just yet.

We're hoping this new architecture will give us more flexibility to deploy new such backends in the future and isolate performance issues to improve diagnostics. It's part of the normal roadmap in scaling a large GitLab installation such as ours.

Alternatives considered

Full read-only backups

We have considered performing a full backup of the entire git repositories before the migration. Unfortunately, this would require setting a read-only mode on all of GitLab for the duration of the backup which, according to our test, could take anywhere from 20 to 60 minutes, which seemed like an unacceptable downtime.

Note that we have nightly backups of the GitLab server of course, which is also backed by RAID-10 disk arrays on two different servers. We're only talking about a fully-consistent Git backup here, our normal backups (which, rarely, can be inconsistent and require manual work to reconnect some refs) are typically sufficient anyways. See tpo/tpa/team#40518 for a discussion on GitLab backups.

Partial or on-demand migration

We have also considered doing a more piecemeal approach and just migrating some repositories. We worry that this approach would lead to confusion about the real impact of the migration.

Still, if any TPA member feels strongly enough about this to put a veto on this proposal, we can take this path and instead migrate a few repositories instead.

We could, for example, migrate only the "alpha" targets and a few key repositories in the tpo/applications and tpo/core groups (since they're prime crawler targets), and leave the mass migration to a later time, with a longer test period.

References and discussions

See the discussion issue for comments and more background.