Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Summary: TODO

Background

Just like for the monitoring system (see TPA-RFC-33), we are now faced with the main mail server becoming unsupported by Debian LTS in June 2024. So we are in need of an urgent operation to upgrade that server.

But, more broadly, we still have all sorts of email delivery problems, mainly due to new provider requirements for deliverability. Email forwarding, the primary mechanism by which we provide email services @torproject.org right now, is particularly unreliable as we fail to deliver email from Gmail to email accounts forwarding to Gmail, for example (tpo/tpa/team#40632, tpo/tpa/team#41524).

We need a plan for email.

History

It's not the first time we look at this problem.

In late 2021, TPA adopted OKRs to improve mail services. At first, we took the approach of fixing the mail infrastructure with an ambitious, long term plan (TPA-RFC-15) to deploy new email standards like SPF, DKIM, and DMARC. The proposal was then rejected as requiring too much time and labour.

So, in TPA-RFC-31, we proposed the option of outsourcing email services as much as possible, including retiring Schleuder (TPA-RFC-41) and migrating from Mailman to Discourse to avoid the possibly painful Mailman upgrade. Those proposals were rejected as well (see tpo/tpa/team#40798) as we had too many services to self-host to have a real benefit in outsourcing.

Shortly after this, we had to implement emergency changes (TPA-RFC-44) to make sure we could still deliver email at all. This split the original TPA-RFC-15 proposal in two, a set of emergency changes and a long term plan. The emergency changes were adopted (and mostly implemented) but the long term plan was postponed to a future proposal.

This is that proposal.

Proposal

Requirements

Those are the requirements that TPA has identified for the mail services architecture.

Must have

  • Debian upgrades: we must upgrade our entire fleet to a supported Debian release urgently

  • Email storage: we currently do not offer actual mailboxes for people, which is confusing for new users and impractical for operations

  • Improved email delivery: we have a large number of concerns with email delivery, which often fails in part due to our legacy forwarding infrastructure, in part

  • Heterogeneous environment: our infrastructure is messy, made of dozens of intermingled services that each have their own complex requirements (e.g. CiviCRM sends lots of emails, BridgeDB needs to authenticate senders), and we cannot retire or alter those services enough to provide us with a simpler architecture, our email services therefore need to be flexible to cover all the current use cases

Nice to have

  • Minimal user disruption: we want to avoid disrupting user's workflows too much, but we want to stress that our users workflow is currently so diverse that it's hard to imagine providing a unified, reliable service without significant changes to a significant part of the user base

  • "Zero-knowledge" email storage: TPA and TPI currently do not have access to emails at rest, and it would be nice to keep it that way, possibly with mailboxes encrypted with a user-controlled secret, for example

  • Cleaner architecture: our mail systems are some of the oldest parts of the infrastructure and we should use this opportunity to rebuild things cleanly, or at least not worsen the situation

  • Improved monitoring: we should be able to tell when we start failing to deliver mail, before our users

Non-Goals

  • authentication improvements: a major challenge in onboarding users right now is the way our authentication systems is an arcane LDAP server that is hard to use. this proposal doesn't aim to change this, as it seems we've been able to overcome this challenge for the submission server so far. we acknowledge this is a serious limitation, however, and do hope to eventually solve this.

    We should also mention that we've been working on improving userdir-ldap so it can parse emails sent by Thunderbird. In our experience, this has been a terrible onboarding challenge for new users as they simply couldn't operate the email gateway with their email client. The LDAP server remains a significant usability problem, however.

Scope

This proposal affects the all inbound and outbound email services hosted under torproject.org. Services hosted under torproject.net are not affected.

It also does not address directly phishing and scamming attacks (issue 40596), but it is hoped that stricter enforcement of email standards will reduce those to a certain extent.

Affected users

This affects all users which interact with torproject.org and its subdomains over email. It particularly affects all "tor-internal" users, users with LDAP accounts, or forwards under @torproject.org.

It especially affects users which send email from their own provider or another provider than the submission service. Those users will eventually be unable to send mail with a torproject.org email address.

Users on other providers will also be affected, as email they currently receive as forwards will change.

See the Personas section for details.

Emergency changes

Some changes we cannot live without. We strongly recommend prioritizing this work so that we have basic mail services supported by Debian security.

We would normally just do this work, but considering we lack a long term plan, we prefer to fit this in the larger picture, with the understanding some of this work is wasted as (for example) eugeni is planned on being retired.

Mailman 3 upgrade

Build a new mailing list server to host the upgraded Mailman 3 service. Move old lists over and convert them, keeping the old archives available for posterity.

This includes lots of URL changes and user-visible disruption, little can be done to work around that necessary change. We'll do our best to come up with redirections and rewrite rules, but ultimately this is a disruptive change.

We are hoping to hook the authentication system with the existing email authentication password, but this is a "nice to have". The priority is to complete the upgrade in a timely manner.

Eugeni in-place upgrade

Once Mailman has been safely moved aside and is shown to be working correctly, upgrade Eugeni using the normal procedures. This should be a less disruptive upgrade, but is still risky because it's such an old box with lots of legacy.

Medium term changes

Those are changes that should absolutely be done, but that can be done after the LTS deadline.

Deploy a new, sender-rewriting, mail exchanger

This step is carried over from TPA-RFC-44, mostly unchanged.

Configure new "mail exchanger" (MX) server(s) with TLS certificates signed by a public CA, most likely Let's Encrypt for incoming mail, replacing that part of eugeni (tpo/tpa/team#40987), which will hopefully resolve issues with state.gov (tpo/tpa/team#41073, tpo/tpa/team#41287, tpo/tpa/team#40202) and possibly others (tpo/tpa/team#33413).

This would take care of forwarding mail to other services (e.g. mailing lists) but also end-users.

To work around reputation problems with forwards (tpo/tpa/team#40632, tpo/tpa/team#41524), deploy a Sender Rewriting Scheme (SRS) with postsrsd (packaged in Debian, but not in the best shape) and postforward (not packaged in Debian, but zero-dependency Golang program). It's possible deploying ARC headers with OpenARC, Fastmail's authentication milter (which apparently works better), or rspamd's arc module might be sufficient as well, to be tested.

Having it on a separate mail exchanger will make it easier to swap in and out of the infrastructure if problems would occur.

The mail exchangers should also sign outgoing mail with DKIM.

Long term changes

Those changes are not purely mandatory, but will make our lives easier in lots of ways. In particular, it will give TPA the capacity to actually provide email services to people we onboard, something which is currently left to the user. It should also make it easier to deliver emails for users, especially internally, as we will control both ends of the mail delivery system.

We might still have trouble delivering email to the outside world, but that should normally improve as well. That is because we will not be forwarding mail to the outside, which basically makes use masquerade as other mail servers, triggering all sorts of issues.

Controlling our users' mailboxes will also allow us to implement stricter storage policies like on-disk encryption and stop leaking confidential data to third parties. It will also allow us to deal with situations like laptop seizures or security intrusions better as we will be able to lock down access to a compromised or vulnerable user, something which is not possible right now.

Mailboxes

We are currently already using Dovecot in a limited way on some servers, but in this project we would deploy actual mailboxes for user.

We should be able to reuse some of our existing Puppet code for this deployment. The hard part is to provide high availability for this service.

High availability mailboxes

In a second phase, we'll take extra care to provide a higher quality of service for mailboxes than our usual service level agreements (SLA). In particular, the mailbox server should be replicated, in near-realtime, to a secondary cluster in an entirely different location. We'll experiment with the best approach for this, but here are the current possibilities:

  • DRBD replication (real-time, possibly large performance impact)
  • ZFS snapshot replication (periodic sync, less performance impact)
  • periodic sync job (doveadm sync or other mailbox sync clients, low frequency periodic sync, moderate performance impact)

The goal is to provide near zero-downtime service (tpo/tpa/team#40604) having special rotation procedures so that reboots provide a routine procedure for rotating the servers, so that a total cluster failure is recovered easily.

Three replicas (two in-cluster, one outside) could allow for IP-based redundancy with near-zero downtimes, while DNS would provide cross-cluster migrations with a few minutes downtime.

Mailbox encryption

We should provide at-rest mailbox encryption, so that TPA cannot access people's emails. This could be implemented in Dovecot with the trees plugin written by a core Tor contributors (dgoulet). Alternatively, Stalwart supports OpenPGP-based encryption as well.

Webmail

The webmail will likely be deployed with Roundcube, alongside the IMAP server. Alternatives like Snappymail could be considered.

Webmail HA

Like the main mail server, the webmail server (which should be separate) will be replicated in a "hot-spare" configuration, although that will be done with PostgreSQL replication instead of disk-based configuration.

An active-active configuration might be considered.

Incoming mail filtering

Deploy a tool for inspection of incoming mail for SPF, DKIM, DMARC records, affecting either "reputation" (e.g. add a marker in mail headers) or just downright rejection (e.g. rejecting mail before queue).

We currently use Spamassassin for this purpose (only on RT), and we could consider collaborating with the Debian listmasters for the Spamassassin rules.

However, rspamd should also be evaluated as part of this work to see if it is a viable alternative. It has been used to deploy the new mail filtering service at koumbit.org recently, and seems generally to gain a lot of popularity as the new gold standard. It is particularly interesting that it could serve as a policy daemon in other places that do not actually need to filter incoming mail for deliver, instead signing outgoing mail with ARC/DMARC headers.

End-to-end deliverability checks

End-to-end deliverability monitoring involves:

  • actual delivery roundtrips
  • block list checks
  • DMARC/MTA-STS feedback loops (covered below)

This will be implemented as Prometheus checks (issue 40539). This also includes evaluating how to monitor metrics offered by Google postmaster tools and Microsoft (issue 40168).

DMARC and MTA-STS reports analysis

DMARC reports analysis are also covered by issue 40539, but are implemented separately because they are considered to be more complex.

This might also include extra work for MTA-STS feedback loops.

Hardened DNS records

We should consider hardening our DNS records. This is a minor, quick change but that we can deploy only after monitoring is in place, which is not currently the case.

This should improve our reputation a bit as some providers treat a negative or neutral policy as "spammy".

CiviCRM bounce rate monitoring

We should hook CiviCRM into Prometheus to make sure we have visibility on the bounce rate that is currently manually collated by mattlav.

New mail transfer agent

Configure new "mail transfer agent" server(s) to relay mails from servers that do not send their own email, replacing a part of eugeni.

All servers would submit email through this server using mutual TLS authentication the same way eugeni currently does this service. It would then relay those emails to the external service provider.

This is similar to current submission server, except with TLS authentication instead of password.

This server will be called mta-01.torproject.org and could be horizontally scaled up for availability. See also the Naming things challenge below.

eugeni retirement

Once the mail transfer agents, mail exchangers, mailman and schleuder servers have been created and work correctly, eugeni is out of work. It can be archived and retired, with a extra long grace period.

Puppet refactoring

Refactor the mail-related code in Puppet, and reconfigure all servers according to the mail relay server change above, see issue 40626 for details. This should probably happen before or at least during all the other long-term improvements.

Cost estimates

Most of the costs of this project are in staff hours, with estimates ranging from 3 to 6 months of work.

Staff

This is an estimate of the time it will take to complete this project, based on the tasks established in the proposal.

Following the Kaplan-Moss estimation technique, as a reminder, we first estimate each task's complexity:

ComplexityTime
small1 day
medium3 days
large1 week (5 days)
extra-large2 weeks (10 days)

... and then multiply that by the uncertainty:

Uncertainty LevelMultiplier
low1.1
moderate1.5
high2.0
extreme5.0

Emergency changes: 3-6 weeks

TaskEstimateUncertaintyTotal
Mailman 3 upgrade1 weekhigh2 weeks
eugeni upgrade1 weekhigh2 weeks
Sender-rewriting mail exchanger1 weekhigh2 weeks
Total3 weeks~high6 weeks

Mailboxes for alpha testers: 5-8 weeks

TaskEstimateDaysUncertaintyTotaldaysNote
Mailboxes1 week5low1 week5.5
Webmail3 days3low3.3 days3.3
incoming mail filtering1 week5high2 weeks10needs research
e2e delivery checks3 days3medium4.5 days4.5access to other providers uncertain
DMARC/MTA-STS reports1 week5high2 weeks10needs research
CiviCRM bounce monitoring1 day1medium1.5 days1.5
New mail transfer agent3 days3low3.3 days3.3similar to current submission server
eugeni retirement1 day1low1.1 days1.1
Total5 weeks26medium8 weeks39.2

High availability and general availability: 5-9 weeks

TaskEstimateDaysUncertaintyTotalDays
Mailbox encryption1 week5medium7.5 days7.5
Mailboxes HA2 weeks10high4 weeks20
Webmail HA3 days3high1 week6
Puppet refactoring1 week5high2 weeks10
Total5 weeks19high9 weeks43.5

Hardware: included

In TPA-RFC-15, we estimated costs to host the mailbox services on dedicated hardware at Hetzner, which added up (rather quickly) to ~22000EUR per year.

Fortunately, in TPA-RFC-43, we adopted a bold migration plan that provided us with a state of the art, powerful computing cluster in a new location. It is be more than enough to host mailboxes, so hardware costs for this project are already covered by that expense, assuming we still fit inside 1TB of storage (10GB mailbox size on average, with 100 mailboxes).

Timeline

The following section details timelines of how this work could be performed over time. A "utopian" timeline is established just to be knocked down, and then a more realistic (but still somewhat optimistic) scenario is proposed.

Utopian

This timeline reflects an ideal (and non-realistic) scenario where one full time person is assigned continuously on this work, starting in August 2024, and that the optimistic cost estimates are realized.

  • W31: emergency: Mailman 3 upgrade
  • W32: emergency: eugeni upgrade
  • W33-34: sender-rewriting mail exchanger
  • end of August 2024: critical mid-term changes implemented
  • W35: mailboxes
  • W36 (September 2024): webmail, end-to-end deliverability checks
  • W37: incoming mail filtering
  • W38: DMARC/MTA-STS reports
  • W39: new MTA, CiviCRM bounce rate monitoring
  • W40: eugeni retirement
  • W41 (October 2024): Puppet refactoring
  • W42: Mailbox encryption
  • W43-W44: Webmail HA
  • W45-W46 (November 2024): Mailboxes HA

Having the Puppet refactoring squeezed in at the end there is particularly unrealistic.

More realistic

In practice, the long term mailbox project will most likely be delayed to somewhere in 2025.

This more realistic timeline still rushes in emergency and mid-term changes to improve quality of life for our users.

In this timeline, the most demanding users will be able to migrate to TPA-hosted email infrastructure by June 2025, while others will be able to progressively adopt the service earlier, in September 2024 (alpha testers) and April 2025 (beta testers).

Emergency changes: Q3 2024

  • W31: emergency: Mailman 3 upgrade
  • W32: emergency: eugeni upgrade
  • W33-34: sender-rewriting mail exchanger
  • end of August 2024: critical mid-term changes implemented

Mailboxes for alpha testers: Q4 2024

  • September-October 2024:
    • W35: mailboxes
    • W36: webmail
    • W37: end-to-end deliverability checks
    • W38-W39: incoming mail filtering
    • W40-W44: monitoring, break for other projects
  • November-December 2024:
    • W45-W46: DMARC/MTA-STS reports
    • W47: new MTA, CiviCRM bounce rate monitoring
    • W48: eugeni retirement
    • W49-W1: monitoring, break for holidays
  • Throughout: Puppet refactoring

HA and general availability: 2025

  • January-Marc 2025: break
  • April 2025: Mailbox encryption
  • May 2025: Webmail HA in testing
  • June 2025: Mailboxes HA in testing
  • September/October 2025: Mailboxes/Webmail HA general availability

Challenges

This proposal brings a number of challenges and concerns that we have considered before bringing it forward.

Staff resources and work overlap

We are already a rather busy team, and the work planned in this proposal overlaps with the work planned in TPA-RFC-33. We've tried to stage the work over the course of a year (or more, in fact) but the emergency work is already too late and will compete with the other proposal.

We do, however, have to deal with this emergency, and we would much rather have a clear plan on how to move forward with email, even if that means we can't execute this for months, if not years, until things calm down and we get capacity. We have designed the tasks to be independent form each other as much as possible and much of the work can be done incrementally.

TPA-RFC-15 challenges

The infrastructure planned recoups many of the challenges described in the TPA-RFC-15 proposal, namely:

  • Aging Puppet code base: this is mitigated by focusing on monitoring and emergency (non-Puppet) fixes at first, but issue 40626 ("cleanup the postfix code in puppet") remains, of course; note that this is an issue that needs to be dealt with regardless of the outcome of this proposal

  • Incoming filtering implementation: still somewhat of an unknown, although TPA operators have experience setting up spam filtering system, we're hoping to setup a new tool (rspamd) for which we have less experience; this is mitigated by delaying the deployment of the inbox system to later, and using sender rewriting (or possibly ARC)

  • Security concerns: those remain an issue. those are two-folder: lack of 2FA and extra confidentiality requirements due to hosting people's emails, which could be mitigated with mailbox encryption

  • Naming things: somewhat mitigated in TPA-RFC-31 by using "MTA" or "transfer agent" instead of "relay"

TPA-RFC-31 challenges

Some of the challenges in TPA-RFC-31 also apply here as well, of course. In particular:

  • sunk costs: we spent, again, a long time making TPA-RFC-31, and that would go to waste... but on the up side: time spent on TPA-RFC-15 and previous work on the mail infrastructure would be useful again!

  • Partial migrations: we are in the "worst case scenario" that was described in that section, more or less, as we have tried to migrate to an external provider, but none of the ones we had planned for can fix the urgent issue at hand; we will also need to maintain Schleuder and Mailman services regardless of the outcome of this proposal

Still more delays

As foretold by TPA-RFC-31: Challenges, Delays and TPA-RFC-44: More delays, we're now officially late.

We don't seem to have much of a choice, at least for the emergency work. We must perform this upgrade to keep our machines secure.

For the long term work, it will take time to rebuild our mail infrastructure, but we prefer to have a clear, long-term plan to the current situation where we are hesitant in deploying any change whatsoever because we don't have a design. This hurts our users and our capacity to help them.

It's possible we fail at providing good email services to our users. If we do, then we fall back to outsourcing mailboxes, but at least we gave it one last shot and we don't feel the costs are so prohibitive that we should just not try.

User interface changes

Self-hosting, when compared to commercial hosting services like Gmail, suffer from significant usability challenges. Gmail, in particular, has acquired a significant mind-share of how email should even work in the first place. Users will be somewhat jarred by the change and frustrated by the unfamiliar interface.

One mitigation for this is that we still allow users to keep using Gmail. It's not ideal, because we keep a hybrid design and we still leak data to the outside, but we prefer this to forcing people into using tools they don't want.

Architecture diagram

TODO: rebuild architecture diagrams, particularly add a second HA stage and show the current failures more clearly, e.g. forwards

The architecture of the final system proposed here is similar to the one proposed in the TPA-RFC-15 diagram, although it takes it a step further and retires eugeni.

Legend:

  • gray: legacy host, mostly eugeni services, split up over time and retired
  • orange: delivery problems with the current infrastructure
  • green: new hosts, MTA and mx can be trivially replicated
  • rectangles: machines
  • triangle: the user
  • ellipse: the rest of the internet, other mail hosts not managed by tpo

Before

current mail architecture diagram

After long-term improvements

final mail architecture diagram

Changes in this diagram:

  • added:
    • MTA server
    • mailman, schleuder servers
    • IMAP / webmail server
  • changed:
    • users forced to use the submission and/or IMAP server
  • removed: eugeni, retired

TODO: ^^ redo summary

TODO: dotted lines are warm failovers, not automatic, might have some downtime, solid lines are fully highly available, which means mails like X will always go through and mails like Y might take a delay during maintenance operations or catastrophic downtimes

TODO: redacted hosts include...

TODO: HA failover workflow TODO: spam and non-spam flows cases

Personas

Here we collect a few "personas" and try to see how the changes will affect them, largely derived from TPA-RFC-44.

We sort users in three categories:

  • alpha tester
  • beta tester
  • production user

We assigned personas to each of those categories, but individual users could opt in our out of any category as they wish. By default, everyone is a production user unless otherwise mentioned.

In italic is the current situation for those users, and what follows are the changes they will go through.

Note that we assume all users have an LDAP account, which might be inaccurate, but this is an evolving situation we've been so far dealing with successfully, by creating accounts for people that lack them and doing basic OpenPGP training. So that is considered out of scope of this proposal for now.

Alpha testers

Those are technical user who are ready to test development systems and even help fix issues. They can tolerate email loss and delays.

Nancy, the fancy sysadmin

Nancy has all the elite skills in the world. She can configure a Postfix server with her left hand while her right hand writes the Puppet manifest for the Dovecot authentication backend. She browses her mail through a UUCP over SSH tunnel using mutt. She runs her own mail server in her basement since 1996.

Email is a pain in the back and she kind of hates it, but she still believes entitled to run their own mail server.

Her email is, of course, hosted on her own mail server, and she has an LDAP account. She has already reconfigured her Postfix server to relay mail through the submission servers.

She might try hooking up her server into the TLS certificate based relay servers.

To read email, she will need to download email from the IMAP server, although it will still be technically possible to forward her @torproject.org email to her personal server directly.

Orpheus, the developer

Orpheus doesn't particular like or dislike email, but sometimes has to use it to talk to people instead of compilers. They sometimes have to talk to funders (#grantlyfe), external researchers, teammates or other teams, and that often happens over email. Sometimes email is used to get important things like ticket updates from GitLab or security disclosures from third parties.

They have an LDAP account and it forwards to their self-hosted mail server on a OVH virtual machine. They have already reconfigured their mail server to relay mail over SSH through the jump host, to the surprise of the TPA team.

Email is not mission critical, and it's kind of nice when it goes down because they can get in the zone, but it should really be working eventually.

They will likely start using the IMAP server, but in the meantime the forwards should keep working, although with some header and possibly sender mangling.

Note that some developers may instead be beta testers or even production users, we're not forcibly including all developers into testing this system, this is opt-in.

Beta testers

Those are power user who are ready to test systems before launch, but can't necessarily fix issues themselves. They can file good bug reports. They can tolerate email delays and limited data loss, but hopefully all will go well.

Gary, the support guy

Gary is the ticket overlord. He eats tickets for breakfast, then files 10 more before coffee. A hundred tickets is just a normal day at the office. Tickets come in through email, RT, Discourse, Telegram, Snapchat and soon, TikTok dances.

Email is absolutely mission critical, but some days he wishes there could be slightly less of it. He deals with a lot of spam, and surely something could be done about that.

His mail forwards to Riseup and he reads his mail over Thunderbird and sometimes webmail. Some time after TPA-RFC_44, Gary managed to finally get an OpenPGP key setup and TPA made him a LDAP account so he can use the submission server. He has already abandoned the Riseup webmail for TPO-related email, since it cannot relay mail through the submission server.

He will need to reconfigure his Thunderbird to use the new IMAP server. The incoming mail checks should improve the spam situation across the board, but especially for services like RT.

John, the external contractor

John is a freelance contractor that's really into privacy. He runs his own relays with some cools hacks on Amazon, automatically deployed with Terraform. He typically run his own infra in the cloud, but for email he just got tired of fighting and moved his stuff to Microsoft's Office 365 and Outlook.

Email is important, but not absolutely mission critical. The submission server doesn't currently work because Outlook doesn't allow you to add just an SMTP server. John does have an LDAP account, however.

John will have to reconfigure his Outlook client to use the new IMAP service which should allow him to send mail through the submission server as well.

He might need to get used to the new Roundcube webmail service or an app when he's not on his desktop.

Blipblop, the bot

Blipblop is not a real human being, it's a program that receives mails and acts on them. It can send you a list of bridges (bridgedb), or a copy of the Tor program (gettor), when requested. It has a brother bot called Nagios/Icinga who also sends unsolicited mail when things fail.

There are also bots that sends email when commits get pushed to some secret git repositories.

Bots should generally continue working properly, as long as they use the system MTA to deliver email.

Some bots currently performing their own DKIM validation will delegate this task to the new spam filter, which will optionally reject mail unless they come from an allow list of domains with a valid DKIM signature.

Some bots will fetch mail over IMAP instead getting email piped in standard input.

Production users

Production users can tolerate little down time and certainly no data loss. Email is mission critical and has high availability requirement. They're not here to test systems, but to work on other things.

Ariel, the fundraiser

Ariel does a lot of mailing. From talking to fundraisers through their normal inbox to doing mass newsletters to thousands of people on CiviCRM, they get a lot done and make sure we have bread on the table at the end of the month. They're awesome and we want to make them happy.

Email is absolutely mission critical for them. Sometimes email gets lost and that's a major problem. They frequently tell partners their personal Gmail account address to work around those problems. Sometimes they send individual emails through CiviCRM because it doesn't work through Gmail!

Their email forwards to Google Mail and they now have an LDAP account to do that mysterious email delivery thing now that Google requires ... something.

They should still be able to send email through the submission server from Gmail, as they currently do, but this might be getting harder and harder.

They will have the option of migrating to the new IMAP / Webmail service as well, once TPA deploys high availability. If they do not, they will use the new forwarding system, possibly with header and sender mangling which might be a little confusing.

They might receive a larger amount of spam than what they were used to at Google. They will need to install another app on their phone to browse the IMAP server to replace the Gmail app. They will also need to learn how to use the new Roundcube Webmail service.

Mallory, the director

Mallory also does a lot of mailing. She's on about a dozen aliases and mailing lists from accounting to HR and other unfathomable things. She also deals with funders, job applicants, contractors, volunteers, and staff.

Email is absolutely mission critical for her. She often fails to contact funders and critical partners because state.gov blocks our email -- or we block theirs! Sometimes, she gets told through LinkedIn that a job application failed, because mail bounced at Gmail.

She has an LDAP account and it forwards to Gmail. She uses Apple Mail to read their mail.

For her Mac, she'll need to configure the IMAP server in Apple Mail. Like Ariel, it is technically possible for her to keep using Gmail, but with the same caveats about forwarded mail.

The new mail relay servers should be able to receive mail state.gov properly. Because of the better reputation related to the new SPF/DKIM/DMARC records, mail should bounce less (but still may sometimes end up in spam) at Gmail.

Like Ariel and John, she will need to get used to the new Roundcube webmail service and mobile app.

Alternatives considered

External email providers

When rejecting TPA-RFC-31, anarcat wrote:

I currently don't see any service provider that can serve all of our email needs at once, which is what I was hoping for in this proposal. the emergency part of TPA-RFC-44 (#40981) was adopted, but the longer part is postponed until we take into account the other requirements that popped up during the evaluation. those requirements might or might not require us to outsource email mailboxes, but given that:

  • we have more mail services to self-host than I was expecting (schleuder, mailman, possibly CiviCRM), and...

  • we're in the middle of the year end campaign and want to close project rather than start them

... I am rejecting this proposal in favor of a new RFC that will discuss, yes, again, a redesign of our mail infrastructure, taking into account the schleuder and mailman hosting, 24/7 mailboxes, mobile support, and the massive requirement of CiviCRM mass mailings.

The big problem we have right now is that we have such a large number of mail servers that hosting mailboxes seems like a minor challenge in comparison. The biggest challenge is getting the large number of emails CiviCRM requires delivered reliably, and for that no provider has stepped up to help.

Hosting email boxes reliably will be a challenge, of course, and we might eventually start using an external provider for this, but for now we're going under the assertion that most of our work is spent dealing with all those small services anyways, and adding one more on top will not significantly change this pattern.

The TPA-RFC-44: alternatives considered section actually went into details for each external hosting provider (community and commercial), and those comments are still considered valid.

In-place Mailman upgrade

We have considered upgrading Mailman directly on eugeni, by upgrading the entire box to bullseye at once. This feels too risky: if there's a problem with the upgrade, all lists go down and recovery is difficult.

It feels safer to start with a new host and import the lists there, which is how the upgrade works anyways, even when done on the same machine. It also allows us to separate that service, cleaning up the configuration a little bit and moving more things into Puppet.

Postfix / Dovecot replacements

We are also aware of a handful of mail software stack emerging as replacements to the ad-hoc Postfix / Dovecot standard.

We know of the following:

  • maddy - IMAP/SMTP server, mail storage is "beta", recommends Dovecot
  • magma - SMTP/IMAP, lavabit.com backend, C
  • mailcow - wrapper around Dovecot/Postfix, not relevant
  • mailinabox - wrapper around Dovecot/Postfix, not relevant
  • mailu - wrapper
  • postal - SMTP-only sender
  • sympl.io - wrapper around Dovecot/Exim, similar
  • sovereign - yet another wrapper
  • Stalwart - JMAP, IMAP, Rust, built-in spam filtering, OpenPGP/SMIME encryption, DMARC, SPF, DKIM, ARC, Sieve, web-based control panel, promising, maybe too much so? no TPA staff has experience, could be used for high a availability setup as it can use PostgreSQL and S3 for storage, not 1.x yet but considered production ready
  • xmox - no relay support or 1.x release, seems like one-man project

Harden mail submission server

The mail submission server currently accepts incoming from any user, with any From header, which is probably a mistake. It's currently considered out of scope for this proposal, but could be implemented if it fits conveniently with other tasks (the spam filter, for example).

References

Appendix

Current issues and their solutions

TODO go through the improve mail services milestone and extra classes of issues, document their solutions here