The anatomy of a great playbook entry

What if you could easily reduce the length of outages by 3X?

According to the SRE book, “recording the best practices ahead of time in a playbook produces roughly a 3x improvement in MTTR”. This improvement mirrors my experience with well-written playbooks.

So what makes a playbook entry “great”?


Remember how you felt in your first on-call rotation, when you were paged at 3am for a system you barely understood? Write your playbook entries for that person.

Playbooks should provide just enough context to confidently work through an incident, without providing extraneous content that will be a burden to keep up-to-date.

Be wary of playbooks that offer exact remediation steps: these are often a sign of sacrificing human blood to a system that should be automated.


Alerts should always include the relevant playbook URL. Otherwise, you will introduce human error by introducing the possibility of the responder following the incorrect playbook.

Consider including the alert name in the playbook URL to make it easier to find. This also the alert template to be templatized in some systems. For example: https://playbooks/%%ALERT_NAME%%


Playbooks are the easiest to scan through in an emergency when they have a consistent structure. The exact best structure may differ depending on the organization, but this is what has worked for me:

The structure that works best is highly dependent on your team's culture, but this is what has worked for me:


The Kubernetes Documentation Style Guide has great recommendations for technical documentation, but the most important for playbooks is: make your commands trivial to copy and paste.


Keep playbooks up to date by:

Big-bang efforts such as auditing all of the playbooks for relevance are best made once initially, to get the playbooks into the same structure. I have never seen quarterly playbook reviews work.

Special thanks to Joseph Bironas for editorial feedback and ideas for this article.