Summary: adopt an incident response procedure and templates, use them more systematically.

Background
Proposal
Examples
- GitLab downtime incident
- DNSSEC outage
Alternatives considered
References

Background

Since essentially forever, our incident response procedures have been quite informal, based mostly on hunches and judgement of staff during high stress situations.

This makes those situations more difficult and stressful than they already are. It's also hard to followup on issues in a consistent manner.

Last week, we had three more incidents that spurred anarcat into action into formalizing this process a little bit. The first objective was to make a post-mortem template that could be used to write some notes after an incident, but it grew to describe a more proper incident response procedure.

Proposal

The proposal consists of:

A template

This is a GitLab issue template (.gitlab/issue_templates/Incident.md) that gets used when you create an incident in GitLab or when you pick the Incident template when reporting an issue.

It reuses useful ideas from previous incidents like having a list of dashboards to check and a checklist of next steps, but also novel ideas like clearer roles of who does what.

It also includes a full post-mortem template while still trying to keep the whole thing lightweight.

This template is not set in stone by this proposal, we merely state, here, that we need such a template. Further updates can be made to the template without going through a RFC process, naturally. The first draft of this template is in merge request tpo/tpa/team!1.
A process:

The process is the companion document to the template. It expands on what each role does, mostly, and spells out general principles. It lives in the howto/incident-response page which is the generic "in case of fire" entry point in our documentation.

The first draft of this process is in merge request !86 in the wiki. It includes:
- the principle of filing and documenting issues as we go
- getting help
- Operations, Communications, Planning and Commander roles imported from the Google SRE book.
- writing a post-mortem for larger incidents

This is made into a formal proposal to bring attention to those new mechanisms, offer a space for discussion, and make sure we at least try to use those procedures during the next incidents, in particular the issue template.

Feedback is welcome either in the above merge requests, in the discussion issue, or by email.

Examples

Those are examples of incidents that happened before this proposal was adopted, but have more or less followed the proposed procedure.

GitLab downtime incident

In tpo/tpa/team#42218, anarcat was working on an outage with GitLab when he realized the situation was so severe that it warranted a status site update. He turned to lelutin and asked him to jump on communications.

Ultimately, the documentation around that wasn't sufficient, and because GitLab was down, updates to the site were harder, but lelutin learned how to post updates to the status site without GitLab and the incident resolved nicely.

DNSSEC outage

In tpo/tpa/team#42308, a DNSSEC rotation went wrong and caused widespread outages in internal DNS resolvers, which affected many services and caused a lot of confusion.

groente was, at first, in lead with anarcat doing planning, but eventually anarcat stepped in to delegate communications to lelutin and take over lead while groente kept hacking at the problem in the background.

lelutin handled communications with others on IRC and issues, anarcat kept the list of "next steps" up to date and wrote most of the post-mortem, which was amended by groente. Many issues were opened or linked in followup to improve the situation next time.

Alternatives considered

Other policies

There are of course many other incident response policies out there. We were inspired at least partly by some of those:

Google SRE book: roles come from here, general principles quoted directly
Got game? Secrets of great incident management
Pager Duty incident response documentation

No logs, no master, no commander?

A lot of consideration has been given to the title "Commander". The term was first proposed as is from the Google SRE book. But according to Wikipedia:

Commander [...] is a common naval officer rank as well as a job title in many armies. Commander is also used as a [...] title in other formal organizations, including several police forces. In several countries, this naval rank is termed as a frigate captain.

Commander is also a generic term for an officer commanding any armed forces unit, such as "platoon commander", "brigade commander" and "squadron commander". In the police, terms such as "borough commander" and "incident commander" are used.

We therefore need to acknowledge the fact that the term originally comes from the military, which is not typically how we like to organize our work. This raise a lot of eyebrows in the review of this proposal, as we prefer to work by consensus, leading by example and helping each other.

But we recognized that, in an emergency, deliberation and consensus building might be impossible. We must to delegate power to someone who will do the tough decisions, and it's necessary to have a single person at the helm, a bit like you have a single person on "operations", changing the systems at once, or you have a single person driving a car or a bus in real life.

The commander, however, is also useful because they are typically a person already in a situation of authority in relation with other political units, either inside or outside the organisation. This makes the commander in a better position to remove blockers than others. Note that this often means the person for the role is the Team Lead, especially if politics are involved, but we do not want the Team Lead handling all incidents.

In fact, the best person in Operations (and therefore, by default, Lead) is likely to be the person available that is the most familiar with the system at hand. It also must be clear that the roles can and should be rotated, especially if they become tired or seem to be causing more trouble than worth, just like an aggressive or dangerous driver should be taken off the wheel.

Furthermore, it must be understood that the Incident Lead is not supposed to continuously interfere with Operations, once that role has been delegated: this is not a micro-management facility, it's a helper, un-blocker, tie-breaker role.

We have briefly considered using a more modest term like captain of a ship. Having had some experience sailing on ships, anarcat has in particular developed a deeper appreciation of that role in life-threatening situation, where the Captain (or Skipper) not only has authority but also the skills and thorough knowledge of the ship.

Other terms we considered were:

"coordinator": can too easily be confused with the Planning role, and hides the fact that the person needs to actually makes executive decisions at times
"facilitator": similar problems than coordinator, but worse: even "softer" naming that removes essentially all power from the role, while we must delegate some power to the role

We liked the term Incident Commander because it is a well known terminology used inside (for example at Google) and outside our industry (at FEMA, fire fighters, medical emergencies and so on). The term was therefore not used in its military sense, but in a civilian context.

We also had concerns that, if someone would onboard in TPA and find the "Incident Command" terminology during an emergency, they would be less likely to understand what is going on that if they find another site-specific term.

The term also maps to a noun and a verb (a "Commander" is in "Command" and "Commands") than "Captain" (which would map, presumably, to the verb "Captain" and not really any name but "Command").

Ultimately, the discomfort with the introduction of a military term was too great to be worth it, and we picked the "Incident Lead" role, with the understanding it's not necessarily the Team Lead that inherits the residual Lead role after all delegations, and even less the Team Lead that handles all incidents from the start, naturally.

Keyboard shortcuts