Summary: deploy a 5TiB MinIO server on gnt-fsn, possible future
expansion in gnt-dal, MinIO bucket quota sizes enforcement.
Background
Back in 2023, we've drafted TPA-RFC-56 to deploy a 1TiB SSD object
storage server running MinIO, in the gnt-dal cluster.
Storage capacity limitations
Since then, the server filled up pretty much as soon as network health started using it seriously (incident #42077). In the post-mortem of that incident, we realized we needed much more storage than the MinIO server could provide, likely more along the lines of 5TiB with a yearly growth.
Reading back the TPA-RFC-56 background, we note that we had already identified metrics was already using at least 3.6TiB of storage, but we were assuming we could expand the storage capacity of the cluster to cover for future expansion. This has turned out to be too optimistic and deteriorating global economic climate has led to a price hike we are unable to follow.
Lack of backups
In parallel, we've found that we want to use MinIO for more production workloads, as the service is working well. This includes services that will require backups. The current service does not offer backups whatsoever, so we need to figure out a backup strategy.
Storage use and capacity analysis
As of 2025-03-26, we have about 30TiB available for allocation in
physical volumes on gnt-fsn, aggregated across all servers, but the
minimal available is closer to 4TiB, with two servers with more
available (5TiB and 7TiB).
gnt-dal has 9TiB available, including 4TiB in costly NVMe
storage. Individual capacity varies wildly: the smallest is 300GiB,
the largest is 783GiB for SSD, 1.5TiB for NVMe.
The new backup-storage-01 server at the gnt-dal point of presence
(PoP) has 34TiB available for allocation and 1TiB used, currently only
for PostgreSQL backups. The old backup server (bungei) at the
gnt-fsn PoP has an emergency 620GiB allocation capacity, with 50TiB
used out of 67TiB in the Bacula backups partition.
In theory, some of that space should be reserved for normal backups, but considering a large part of the backup service is used by the network-health team in the first place, we might be able to allocate at least a third or a half of that capacity (10-16TiB) for object storage, on a hunch.
MinIO bucket disk usage
As of 2025-03-26, this is the per-bucket disk usage on the MinIO server:
root@minio-01:~# mc du --depth=2 admin
225GiB 1539 objects gitlab-ci-runner-cache
5.5GiB 142 objects gitlab-dependency-proxy
78GiB 29043 objects gitlab-registry
0B 0 objects network-health
309GiB 30724 objects
During the outage on 2025-03-11, it was:
gitlab-ci-runner-cache 216.6 GiB
gitlab-dependency-proxy 59.7 MiB
gitlab-registry 442.8 GiB
network-health 255.0 GiB
That is:
- the CI runner cache is essentially unchanged
- the dependency proxy is about 10 times larger
- the GitLab registry was about 5 times larger; it has been cleaned up in tpo/tpa/team#42078, from 440GiB to 40GiB, and has doubled since then, but is getting regularly cleaned up
- the network-health bucket was wiped, but could likely have grown to 1 if not 5TiB (see above)
Proposal
The proposal is to setup two new MinIO services backed by hard drives, to provide extra storage space. Backups would be covered by MinIO's native bucket versioning, with optional extraction in the standard Bacula backups for more sensitive workloads.
"Warm" hard disk storage
MinIO clusters support a tiered approach which they also call lifecycle management, where objects can be automatically moved between "tiers" of storage. The idea would be to add new servers with "colder" storage. We'd have two tiers:
- "hot": the current
minio-01server, backed by SSD drives, 1TiB - "warm": a new
minio-fsn-02server, backed by HDD drives. 4TiB
The second tier would be a little "tight" in the gnt-fsn
cluster. It's possible we might have to split it up in smaller 2TiB
chunks or use a third tier altogether, see below.
MinIO native backups with possible exceptions
We will also explore the possibility of the third tier used for
archival/backups and geographical failover. Because the only HDD
storage we have in gnt-dal, that would have to be a MinIO service
running on backup-storage-01 (possibly labeled minio-dal-03). That
approach would widen the attack surface on that server, unfortunately,
so we're not sure we're going to take that direction.
In any case, the proposal is to use the native server-side bucket replication. The risk with that approach is in the case of catastrophic application logic failure in MinIO: this risks propagating catastrophic data loss across the cluster.
For that reason, we would offer, on demand, the option to pull more sensitive data into Bacula, possibly through some tool like s3-fuse. We'd like to hear from other teams whether this would be a requirement for you so we can evaluate whether we need to research this topic any further.
As mentioned above, a MinIO service on the backup server could allow for an extra 10-16TiB storage for backups.
This part is what will require the most research and experimentation. We need to review and test the upstream deployment architecture, distributed design, and the above tiered approach/lifecycle management
Quotas
We're considering setting up bucket quotas to set expectations on bucket sizes. The goal of this would be to reduce the scope of outages for runaway disk usage processes.
The idea would be for bucket users to commit to a certain size. The total size of quotas across all buckets may be larger than the global allocated capacity for the MinIO cluster, but each individual quota size would need to be smaller than the global capacity, of course.
A good rule of thumb could be that, when a bucket is created, its quota is smaller than half of the current capacity of the cluster. When that capacity is hit, half of the leftover capacity could be allocated again. This is just a heuristic, however: exceptions will have to be made in some cases.
For example, if a hungry new bucket is created and we have 10TiB of capacity in the cluster, its quota would be 5TiB. When it hits that quota, half of the leftover capacity (say 5TiB is left if no other allocation happened) is granted (2.5TiB, bringing the new quota to 7.5TiB).
We would also like to hear from other teams about this. We are proposing the following quotas on existing buckets:
gitlab-ci-runner-cache: 500GiB (double current size)gitlab-dependency-proxy: 10GiB (double current size)gitlab-registry: 200GiB (roughly double current size)network-health: 5TiB (previously discussed number)- total quota allocation: ~5.4TiB
This assumes a global capacity of 6TiB: 5TiB in gnt-fsn and 1TiB in
gnt-dal.
And yes, this violates the above rule of thumb, because
network-health is so big. Eventually, we want to develop the
capacity for expansion here, but we need to start somewhere and do not
have the capacity to respect the actual quota policy for
starters. We're also hoping the current network-health quota will be
sufficient: if it isn't, we'll need to grow the cluster capacity
anyways.
Affected users
This proposal mainly affects TPA and the Network Health team.
The Network Health team future use of the object storage server is particularly affected by this and we're looking for feedback from the team regarding their future disk usage.
GitLab users may also be indirectly affected by expanded use of the object storage mechanisms. Artifacts, in particular, could be stored in object storage which could improve latency in GitLab Continuous Integration (CI) by allowing runners to push their artifacts to object storage.
Timeline
- April 2025:
- setup minio-fsn-02 HDD storage server
- disk quotas deployment and research
- clustering research:
- bucket versioning experiments
- disaster recovery and backup/restore instructions
- replication setup and research
- May 2025 or later: (optional) setup secondary storage server in
gnt-dal cluster (on
gnt-dalorbackup-storage-01)
Costs estimates
| Task | Complexity | Uncertainty | Estimate |
|---|---|---|---|
minio-fsn-02 setup | small (1d) | low | 1.1d |
| disk quotas | small (1d) | low | 1.1d |
| clustering research | large (5d) | high | 10d |
minio-dal-03 setup | medium (3d) | moderate | 4.5d |
| Total | extra-large (10d) | ~moderate | 16.7d |
Alternatives considered
This appendix documents a few options we have discussed in our research but ended up discarding for various reasons.
Storage expansion
Back in TPA-RFC-56, we discussed the possibility of expanding the
gnt-dal cluster with extra storage. Back then (July 2023), we
estimated the capital expenditures to be around 1800$USD for 20TiB of
storage. This was based on the cost of the Intel® SSD D3-S4510
Series being around 210$USD for 1.92TB and 255$USD for 3.84TB.
As we found out while researching this possibility again in 2025 (issue 41987), the prices of the 3.84TB doubled to 520$ on Amazon. The 1.92TB price raise was more modest, but it's still more expensive, at 277$USD. This could be related to an availability issue of those specific drives, however. A similar D3-S4520 is 235$USD for 1.92TB and 490$USD for 3.84TB.
Still, we're talking about at least double the original budget for this
expansion, so at least 4000$USD for a 10TiB expansion (after RAID), so
it's considered too expensive for now. We might still want to
consider getting a couple of 3.84TB drives to give us some breathing
room in the gnt-dal cluster, but this proposal doesn't rely on this
to resolve the primary issues set in this proposal.
Inline filesystem backups
We looked into other solutions for backups. We considered using LVM, BTRFS or ZFS snapshots but MinIO folks are pretty adamant that you shouldn't use snapshots underneath MinIO for performance reasons, but also because they consider MinIO itself to be the "business continuity" tool.
In other words, you're not supposed to need backups with a proper MinIO deployment, you're supposed to use replication, along with versioning on the remote server, see the How to Backup and Restore 100PB of Data with Zero RPO and RTO post.
The main problem to such setups (which also affect, e.g. filesystem based backups like ZFS snapshots) is what happens when a software failure propagates across the snapshot boundary. In this case, MinIO says:
These are reasonable things to plan for - and you can.
Some customers upgrade independently, leaving one side untouched until they are comfortable with the changes. Others just have two sites and one of them is the DR site with one way propagation.
Resiliency is a choice and a tradeoff between budget and SLAs. Customers have a range of choices that protect against accidents.
That is disconcertingly vague. Stating "a range of options" without clearly spelling them out sounds like a cop-out to us. One of the options proposed ("two sites and one of them is the DR site with one way propagation") doesn't address the problem at all. The other option proposed ("upgrade independently") is actually incompatible with the site replication requirement of "same server version" which explicitly states:
All sites must have a matching and consistent MinIO Server version. Configuring replication between sites with mismatched MinIO Server versions may result in unexpected or undesired replication behavior.
We've asked the MinIO people to clarify this in an email. They responded pretty quickly with an offer for a real-time call, but we failed to schedule a call, and they failed to followup by email.
We've also looked at LVM-less snapshots with fsfreeze(1) and dmsetup(8), but that requires the filesystem to be unmounted first. But that, in turn, could be actually interesting as it allows for minimal downtime backups of a secondary MinIO cluster, for example.
We've also considered bcachefs as a supposedly performant BTRFS replacement, but performance results from phoronix were disappointing, showed usability, and another reviewer had data loss issues, so clearly not ready for production either.
Other object storage implementation
We have considered setting up a second block storage cluster with a different software (e.g. Garage) can help with avoiding certain faults.
This was rejected because it adds a fairly sizeable load on the team, to maintain not just one but two clusters with different setups and different administrative commands.
We acknowledge that our proposed setup means a catastrophic server failure implies a complete data loss.
Do note that other implementations do not prevent catastrophic operator errors from destroying all data, however.