Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help


title: "TPA-RFC-44: Email emergency recovery, phase A" costs: 1 week to 4 months staff approval: Executive director, TPA affected users: torproject.org email users deadline: "monday", then 2022-12-23 status: standard discussion: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40981

Summary: scrap the idea of outsourcing our email services and just implement as many fixes to the infrastructure as we can in the shortest time possible, to recover the year-end campaign and CiviCRM deliverability. Also consider a long term plan, compatible with the emergency measures, to provide quality email services to the community in the long term.

Background

In late 2021, TPA adopted OKRs to improve mail services. At first, we took the approach of fixing the mail infrastructure with an ambitious, long term plan of incrementally deploying new email standards like SPF, DKIM, and DMARC across the board. This approach was investigated fully in TPA-RFC-15 but was ultimately rejected as requiring too much time and labour.

So, in TPA-RFC-31, we investigated the other option: outsourcing email services. The idea was to outsource as much mail services as possible, which seemed realistic especially since we were considering Schleuder's retirement (TPA-RFC-41) and that we might migrate from Mailman to Discourse to avoid the possibly painful Mailman upgrade. A lot of effort was poured into TPA-RFC-31 to design what would be the boundaries of our email services and what would be outsourced.

A few things came up that threw a wrench in this plan.

Current issues

This proposal reconsiders the idea of outsourcing email for multiple reasons.

  1. We have an urgent need to fix the mail delivery system backing CiviCRM. As detailed in Bouncing Emails Crisis ticket, we have gone from 5-15% bounce rate to nearly 60% in October and November.

  2. The hosting providers that were evaluated in TPA-RFC-15 and TPA-RFC-31 seem incapable of dealing either with the massive mailings we require or the mailbox hosting.

  3. Rumors of Schleuder's and Mailman's demise were grossly overstated. It seems like we will have to both self-host Discourse and Mailman 3 and also keep hosting Schleuder for the foreseeable future, which makes full outsourcing impossible.

Therefore, we wish to re-evaluate the possibility of implementing some emergency fixes to stabilize the email infrastructure, addressing the immediate issues facing us.

Current status

Current status is unchanged from the one current status in TPA-RFC-31, technically speaking. A status page update was posted on November 30th 2022.

Proposal

The proposal is to roll back the decision to reject TPA-RFC-15, but instead of re-implementing it as is, focus on emergency measures to restore CiviCRM mass mailing services.

Therefore, the proposal is split into two sections:

We may adopt only one of those options, obviously.

TPA strongly recommends adopting at least the emergency changes section

We also believe it is realistic to implement a modest, home-made email service in the long term. Email is a core service in any organisation, and it seems reasonable that TPI might be able to self-host this service for a humble number of users (~100 on tor-internal).

See also the alternatives considered section for other options.

Scope

This proposal affects the all inbound and outbound email services hosted under torproject.org. Services hosted under torproject.net are not affected.

It also does not address directly phishing and scamming attacks (issue 40596), but it is hoped that stricter enforcement of email standards will reduce those to a certain extent.

Affected users

This affects all users which interact with torproject.org and its subdomains over email. It particularly affects all "tor-internal" users, users with LDAP accounts, or forwards under @torproject.org.

It especially affects users which send email from their own provider or another provider than the submission service. Those users will eventually be unable to send mail with a torproject.org email address.

Emergency changes

In this stage, we focus on a set of short-term fixes which will hopefully improve deliverability significantly in CiviCRM.

At this stage, we'll have adopted standards like SPF, DKIM, and DMARC across the entire infrastructure. Sender rewriting will be used to mitigate the lack of a mailbox server.

SPF (hard), DKIM and DMARC (soft) records on CiviCRM

  1. Deploy DKIM signatures on outgoing mail on CiviCRM

  2. Deploy a "soft" DMARC policy with postmaster@ as a reporting endpoint

  3. Harden the SPF policy for to restrict it to the CRM servers and eugeni

This would be done immediately.

Deploy a new, sender-rewriting, mail exchanger

Configure new "mail exchanger" (MX) server(s) with TLS certificates signed by a public CA, most likely Let's Encrypt for incoming mail, replacing that part of eugeni.

This would take care of forwarding mail to other services (e.g. mailing lists) but also end-users.

To work around reputation problems caused by SPF records (below), deploy a Sender Rewriting Scheme (SRS) with postsrsd (packaged in Debian) and postforward (not packaged in Debian, but zero-dependency Golang program).

Having it on a separate mail exchanger will make it easier to swap in and out of the infrastructure if problems would occur.

The mail exchangers should also sign outgoing mail with DKIM.

DKIM signatures on eugeni

As a stopgap measure, deploy DKIM signatures for egress mail on eugeni. This will ensure that the DKIM records and DMARC policy added for the CRM will not impact mailing lists too bad.

This is done separately from the other mail hosts because of the complexity of the eugeni setup.

DKIM signature on other mail hosts

Same, but for other mail hosts:

  • BridgeDB
  • CiviCRM
  • GitLab
  • LDAP
  • MTA
  • rdsys
  • RT
  • Submission

Deploy SPF (hard), DKIM, and DMARC records for all of torproject.org

Once the above work is completed, deploy SPF records for all of torproject.org pointing to known mail hosts.

Long-term improvements

In the long term, we want to cleanup the infrastructure and setup proper monitoring.

Many of the changes described here will be required regardless of whether or not this proposal is adopted.

WARNING: this part of the proposal was not adopted as part of TPA-RFC-44 and is deferred to a later proposal.

CiviCRM bounce rate monitoring

We should hook CiviCRM into Prometheus to make sure we have visibility on the bounce rate that is currently manually collated by mattlav.

New mail transfer agent

Configure new "mail transfer agent" server(s) to relay mails from servers that do not send their own email, replacing a part of eugeni.

All servers would submit email through this server using mutual TLS authentication the same way eugeni currently does this service. It would then relay those emails to the external service provider.

This is similar to current submission server, except with TLS authentication instead of password.

This server will be called mta-01.torproject.org and could be horizontally scaled up for availability. See also the Naming things challenge below.

IMAP and webmail server deployment

We are currently already using Dovecot in a limited way on some servers, so we will reuse some of that Puppet code for the IMAP server.

The webmail will likely be deployed with Roundcube, alongside the IMAP server. Both programs are packaged and well supported in Debian. Alternatives like Rainloop or Snappymail could be considered.

Mail filtering is detailed in another section below.

Incoming mail filtering

Deploy a tool for inspection of incoming mail for SPF, DKIM, DMARC records, affecting either "reputation" (e.g. add a marker in mail headers) or just downright rejection (e.g. rejecting mail before queue).

We currently use Spamassassin for this purpose, and we could consider collaborating with the Debian listmasters for the Spamassassin rules. rspamd should also be evaluated as part of this work to see if it is a viable alternative. It has been used to deploy the new mail filtering service at koumbit.org recently.

Mailman 3 upgrade

On a new server, build a new Mailman 3 server and migrate mailing lists over. The new server should be added to SPF and have its own DKIM signatures recorded in DNS.

Schleuder bullseye upgrade

Same, but for Schleuder.

End-to-end deliverability checks

End-to-end deliverability monitoring involves:

  • actual delivery roundtrips
  • block list checks
  • DMARC/MTA-STS feedback loops (covered below)

This may be implemented as Nagios or Prometheus checks (issue 40539). This also includes evaluating how to monitor metrics offered by Google postmaster tools and Microsoft (issue 40168).

DMARC and MTA-STS reports analysis

DMARC reports analysis are also covered by issue 40539, but are implemented separately because they are considered to be more complex (e.g. RBL and e2e delivery checks are already present in Nagios).

This might also include extra work for MTA-STS feedback loops.

eugeni retirement

Once the mail transfer agents, mail exchangers, mailman and schleuder servers have been created and work correctly, eugeni is out of work. It can be archived and retired, with a extra long grace period.

Puppet refactoring

Refactor the mail-related code in Puppet, and reconfigure all servers according to the mail relay server change above, see issue 40626 for details. This should probably happen before or at least during all the other long-term improvements.

Cost estimates

Staff

This is an estimate of the time it will take to complete this project, based on the tasks established in the actual changes section. The process follows the Kaplan-Moss estimation technique.

Emergency changes: 10-25 days, 1 day for CiviCRM

TaskEstimateUncertaintyTotal (days)Note
CiviCRM records1 dayhigh2
New MX1 weekhigh10key part of eugeni, might be hard
eugeni records1 dayextreme5
other records2 daymedium3
SPF hard1 dayextreme5
Total10 days~high25

Long term improvements: 2-4 months, half mandatory

TaskEstimateUncertaintyTotal (days)Note
CiviCRM bounce monitoring2 daysmedium3
New mail transfer agent3 dayslow3.3similar to current submission server
IMAP/webmail deployment2 weekshigh20may require training to onboard users
incoming mail filtering1 weekhigh10needs research
Mailman upgrade1 weekhigh10
Schleuder upgrade1 weekhigh10
e2e deliver. checks3 daysmedium4.5access to other providers uncertain
DMARC/MTA-STS reports1 weekhigh10needs research
eugeni retirement1 daylow1.1
Puppet refactoring1 weekhigh10
Total44 days~high~82

Note that many of the costs listed above will be necessary regardless of whether this proposal is adopted or not. For example, those tasks are hard requirements:

TaskEstimateUncertaintyTotal (days)
CiviCRM bounce monitoring2 daysmedium3
Mailman upgrade1 weekhigh10
Schleuder upgrade1 weekhigh10
eugeni retirement or upgrade1 dayextreme5
Puppet refactoring1 weekhigh10
Total18 days~high38 days

Hardware: included

In TPA-RFC-15, we estimated costs to host the mailbox services on dedicated hardware at Hetzner, which added up (rather quickly) to ~22000EUR per year.

Fortunately, in TPA-RFC-43, we adopted a bold migration plan that will provide us with a state of the art, powerful computing cluster in a new location. It should be more than enough to host mailboxes, so hardware costs for this project are already covered by that expense.

Timeline

Ideal

This timeline reflects an ideal (and non-realistic) scenario where one full time person is assigned continuously on this work, and that the optimistic cost estimates are realized.

  • W50: emergency fixes, phase 1: DKIM records
  • W51: emergency fixes, phase 2: mail exchanger rebuild
  • W52-W53: monitoring, holidays
  • 2023 W1: monitoring, holidays
  • W2: CiviCRM bounce rate monitoring
  • W3: new MTA
  • W4: e2e deliverability checks
  • W5 (February): DMARC/MTA-STS reports
  • W6-W7: IMAP/webmail deployment
  • W8: incoming mail filtering
  • W9 (March): Mailman upgrade
  • W10: Schleuder upgrade
  • W11: eugeni retirement
  • W12 (April): Puppet refactoring

Realistic

In practice, the long term improvements would probably be delayed until June, possible even July or August, especially since part of this work overlaps with the new cluster deployment.

However, this more realistic timeline still rushes the emergency fixes in two weeks and prioritizes monitoring work after the holidays.

  • W50: emergency fixes, phase 1: DKIM records
  • W51: emergency fixes, phase 2: mail exchanger rebuild
  • W52-W53: monitoring, holidays
  • 2023 W1: monitoring, holidays
  • W2: CiviCRM bounce rate monitoring
  • W3: new MTA
  • W4, W5-W8 (February): DMARC/MTA-STS reports, e2e deliverability checks
  • W9 (March):
    • incoming mail filtering
    • IMAP/webmail deployment
  • April:
    • Schleuder upgrade
  • May:
    • Mailman upgrade
  • June:
    • eugeni retirement
  • Throughout: Puppet refactoring

Challenges

Staff resources and work overlap

We are already a rather busy team, and the work planned in this proposal overlaps with the work planned in TPA-RFC-43.

It is our belief, however, that we could split the difference in a way that we could allocate some resources (e.g. lavamind) to building the new cluster and other resources (e.g. anarcat, kez) to deploying emergency measures and the new mail services.

TPA-RFC-15 challenges

The infrastructure planned here recoups many of the challenges described in the TPA-RFC-15 proposal, namely:

  • Aging Puppet code base: this is mitigated by focusing on monitoring and emergency (non-Puppet) fixes at first, but issue 40626 remains, of course; note that this is an issue that needs to be dealt with regardless of the outcome of this proposal

  • Incoming filtering implementation: still somewhat of an unknown, although TPA operators have experience setting up spam filtering system, we're hoping to setup a new tool (rspamd) for which we have less experience; this is mitigated by delaying the deployment of the inbox system to later, and using sender rewriting (or possibly ARC)

  • Security concerns: those remain an issue

  • Naming things: somewhat mitigated in TPA-RFC-31 by using "MTA" or "transfer agent" instead of "relay"

TPA-RFC-31 challenges

Some of the challenges in TPA-RFC-31 also apply here as well, of course. In particular:

  • sunk costs: we spent, again, a long time making TPA-RFC-31, and that would go to waste... but on the up side: time spent on TPA-RFC-15 and previous work on the mail infrastructure would be useful again!

  • Partial migrations: we are in the "worst case scenario" that was described in that section, more or less, as we have tried to migrate to an external provider, but none of the ones we had planned for can fix the urgent issue at hand; we will also need to maintain Schleuder and Mailman services regardless of the outcome of this proposal

More delays

As foretold by TPA-RFC-31: Challenges, Delays, we are running out of time. Making this proposal takes time, and deploying yet another strategy will take more time.

It doesn't seem like there is much of an alternative here, however; no clear outsourcing solution seems to be available to us at this stage, and even if they would, they would also take time to deploy.

The key aspect here is that we have a very quick fix we can deploy on CiviCRM to see if our reputation will improve. Then a fast-track strategy allows us, in theory, to deploy those fixes everywhere without rebuilding everything immediately, giving us a 2 week window during which we should be able to get results.

If we fail, then we fall back to outsourcing again, but at least we gave it one last shot.

Architecture diagram

The architecture of the final system proposed here is similar to the one proposed in the TPA-RFC-15 diagram, although it takes it a step further and retires eugeni.

Legend:

  • red: legacy hosts, mostly eugeni services, no change
  • orange: hosts that manage and/or send their own email, no change except the mail exchanger might be the one relaying the @torproject.org mail to it instead of eugeni
  • green: new hosts, might be multiple replicas
  • rectangles: machines
  • triangle: the user
  • ellipse: the rest of the internet, other mail hosts not managed by tpo

Before

current mail architecture diagram

After emergency changes

current mail architecture diagram

Changes in this diagram:

  • added: new mail exchanger
  • changed:
    • "impersonators" now unable to deliver mail as @torproject.org unless they use the submission server

After long-term improvements

final mail architecture diagram

Changes in this diagram:

  • added:
    • MTA server
    • mailman, schleuder servers
    • IMAP / webmail server
  • changed:
    • users forced to use the submission and/or IMAP server
  • removed: eugeni, retired

Personas

Here we collect a few "personas" and try to see how the changes will affect them.

Ariel, the fundraiser

Ariel does a lot of mailing. From talking to fundraisers through their normal inbox to doing mass newsletters to thousands of people on CiviCRM, they get a lot of shit done and make sure we have bread on the table at the end of the month. They're awesome and we want to make them happy.

Email is absolutely mission critical for them. Sometimes email gets lost and that's a huge problem. They frequently tell partners their personal Gmail account address to workaround those problems. Sometimes they send individual emails through CiviCRM because it doesn't work through Gmail!

Their email is forwarded to Google Mail and they do not have an LDAP account.

TPA will make them an account that forwards to their current Gmail account, with sender rewriting rules. They will be able to send email through the submission server from Gmail.

They will have the option of migrating to the new IMAP / Webmail service as well.

Gary, the support guy

Gary is the ticket master. He eats tickets for breakfast, then files 10 more before coffee. A hundred tickets is just a normal day at the office. Tickets come in through email, RT, Discourse, Telegram, Snapchat and soon, TikTok dances.

Email is absolutely mission critical, but some days he wishes there could be slightly less of it. He deals with a lot of spam, and surely something could be done about that.

His mail forwards to Riseup and he reads his mail over Thunderbird and sometimes webmail.

TPA will make an account for Gary and send the credentials in an encrypted email to his Riseup account.

He will need to reconfigure his Thunderbird to use the submission and IMAP server after setting up an email password. The incoming mail checks should improve the spam situation across the board, but especially for services like RT.

He will need, however, to abandon Riseup for TPO-related email, since Riseup cannot be configured to relay mail through the submission server.

John, the external contractor

John is a freelance contractor that's really into privacy. He runs his own relays with some cools hacks on Amazon, automatically deployed with Terraform. He typically run his own infra in the cloud, but for email he just got tired of fighting and moved his stuff to Microsoft's Office 365 and Outlook.

Email is important, but not absolutely mission critical. The submission server doesn't currently work because Outlook doesn't allow you to add just an SMTP server.

John will have to reconfigure his Outlook to send mail through the submission server and use the IMAP service as a backend.

The first emergency measures will be problematic for John as he won't be able to use the submission service until the IMAP server is setup, due to limitations in Outlook.

Nancy, the fancy sysadmin

Nancy has all the elite skills in the world. She can configure a Postfix server with her left hand while her right hand writes the Puppet manifest for the Dovecot authentication backend. She knows her shit. She browses her mail through a UUCP over SSH tunnel using mutt. She runs her own mail server in her basement since 1996.

Email is a pain in the back and she kind of hates it, but she still believes everyone should be entitled to run their own mail server.

Her email is, of course, hosted on her own mail server, and she has an LDAP account.

She will have to reconfigure her Postfix server to relay mail through the submission or relay servers, if she want to go fancy. To read email, she will need to download email from the IMAP server, although it will still be technically possible to forward her @torproject.org email to her personal server directly, as long as the server is configured to send email through the TPO servers.

Mallory, the director

Mallory also does a lot of mailing. She's on about a dozen aliases and mailing lists from accounting to HR and other obscure ones everyone forgot what they're for. She also deals with funders, job applicants, contractors and staff.

Email is absolutely mission critical for her. She often fails to contact funders and critical partners because state.gov blocks our email (or we block theirs!). Sometimes, she gets told through LinkedIn that a job application failed, because mail bounced at Gmail.

She has an LDAP account and it forwards to Gmail. She uses Apple Mail to read their mail.

For her Mac, she'll need to configure the submission server and the IMAP server in Apple Mail. Like Ariel, it is technically possible for her to keep using Gmail, but with the same caveats about forwarded mail from SPF-hardened hosts.

Like John, this configuration will be problematic after the emergency measures are deployed and before the IMAP server is online, during which time it will be preferable to keep using Gmail.

The new mail relay servers should be able to receive mail state.gov properly. Because of the better reputation related to the new SPF/DKIM/DMARC records, mail should bounce less (but still may sometimes end up in spam) at Gmail.

Orpheus, the developer

Orpheus doesn't particular like or dislike email, but sometimes has to use it to talk to people instead of compilers. They sometimes have to talk to funders (#grantlife) and researchers and mailing lists, and that often happens over email. Sometimes email is used to get important things like ticket updates from GitLab or security disclosures from third parties.

They have an LDAP account and it forwards to their self-hosted mail server on a OVH virtual machine.

Email is not mission critical, but it's pretty annoying when it doesn't work.

They will have to reconfigure their mail server to relay mail through the submission server. They will also likely start using the IMAP server, but in the meantime the forwards will keep working, with the sender rewriting caveats mentioned above.

Blipblop, the bot

Blipblop is not a real human being, it's a program that receives mails from humans and acts on them. It can send you a list of bridges (bridgedb), or a copy of the Tor program (gettor), when requested. It has a brother bot called Nagios/Icinga who also sends unsolicited mail when things fail. Both of those should continue working properly, but will have to be added to SPF records and an adequate OpenDKIM configuration should be deployed on those hosts as well.

There's also a bot which sends email when commits get pushed to gitolite. That bot is deprecated and is likely to go away.

Most bots will be modified to send and receive email through the mail transfer agent, although that will be transparent to the bot and handled by TPA at the system level. Those systems will be modified to implement DKIM signing.

Some bots will need to be modified to fetch mail over IMAP instead of being pushed mail over SMTP.

Alternatives considered

Let's see what we could do instead of this proposal.

Multiple (community) providers

In TPA-RFC-31, we evaluated a few proposals to outsource email services to external service providers. We tend to favor existing partners and groups from our existing community, where we have an existing trust relationship. It seems that, unfortunately, none of those providers will do the job on their own.

It may be possible to combine a few providers together, for example by doing mass mailings with Riseup, and hosting mailboxes at Greenhost. It is felt, however, that this solution would be difficult to deploy reliably, and split the support costs between two organisations.

It would also remove a big advantage of outsourcing email, which is that we have one place to lay the blame if problems occur. If we have two providers, then it's harder to diagnose issues with the service.

Commercial transactional mail providers

We have evaluated a handful of commercial transactional mail providers in TPA-RFC-31 as well. Those are somewhat costly: 200-250$/mth and up, with Mailchimp at the top with 1300$/mth, although to be fair with Mailchimp, they could probably give us a better price if we "contact sales".

Most of those providers try to adhere to the GDPR in one sense or the other. However, when reviewing other privacy policies (e.g. for tpo/tpa/team#40957, I've had trouble figuring out the properties of "processors" and "controllers" of data. In this case, a provider will more likely be a "processor" which puts us in charge of clients' data, but also means they can have "sub-processors" that also have access to the data, and that list can change.

In other words, it's never quite clear who has access to what once we start hosting our data elsewhere. Each of those potential providers have detailed privacy policies and their sub-processors have their own policies.

If we, all of a sudden, start using a commercial transactional mail provider to send all CiviCRM mailings, we would have forcibly opted all those 300+ thousand people into all of those privacy policies.

This feels like a serious breach of trust for our organisation, and a possible legal liability. It would at least be a public relations risk, as our reputation could be negatively affected if we make such a move, especially in an emergency, without properly reviewing the legal implications of it.

TPA recommends to at least try to fix the problem in house, then a community provider before ultimately deferring to a commercial provider. Ideally, some legal advice from the board should be sought before going ahead with this, at least.

Deadline

Emergency work based on this proposal will be started on Monday unless an opposition is expressed before then.

Long term work will start in January unless an opposition is expressed before the holidays (December 23rd).

Status

This proposal is currently in the standard state. Only the emergency part of this proposal is considered adopted, the rest is postponed to a further RFC.

References