How to Perform a Successful Incident Postmortem

When growing a business from a startup to a large enterprise, it’s software systems also expand in complexity, meaning that encountering incidents is inevitable. These incidents have indirect costs that can include loss of trust in the product or brand. However, incidents occurring are not necessarily a bad thing — it provides the business with new learning opportunities and the chance to improve its operational practices. But how do we learn from such a failure?

What is a postmortem?

Who should get invites?

  • Individuals/Teams that First Logged the Incident
  • Individuals/Teams that Responded to the Incident
  • Individuals/Teams that Diagnosed the Issue(s)
  • Individuals/Teams that Rectified the Issue(s)

It may also be appropriate to invite a user that was directly affected by the incident. Invite a range of different people like this maintains transparency and allows us to glean as much information as possible to document.

What should get documented?

  • Date and Time that the Incident Started
  • Date and Time that the Incident was First Logged
  • Which Teams/Individuals Responded to the Incident
  • Number of Users/Accounts Affected
  • Number of Support Requests Raised
  • Date and Time that the Incident was Fixed
  • Any Solutions or Mitigations

In addition to the above details getting recorded in a postmortem document, the meeting should also have minutes taken, and a timeline of the incident constructed.

What should you do after the postmortem?

Values recorded should be compared against any service level agreements (SLAs) that may be in place, to confirm that the incident did not result in any breaches. Any issues identified as a result of the incident should be discussed in-depth, with potential solutions or mitigations planned into the roadmap, alongside rigid delivery dates. These solutions/mitigations should have tickets written to capture the work, each of which should be SMART. Depending on the incident, particularly regarding who first logged the incident and how long it had been ongoing before being logged, improvements to the observability may be required. Observability improvements should be a priority alongside immediate solutions to the faults.

If an external user reported the issue, it might be pertinent to publish the postmortem’s findings openly, allowing anyone access. Publishing postmortem outcomes publicly has most notably been utilised by Monzo. It enables them to maintain transparency, ensures accountability as a business and has provided their users with greater trust in the brand.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store