Writing a Better SOP

Posted on September 25, 2012

Within an ops team you should have 3 primary types of governance enablers: controls, policies and processes. A control is a guiding principle, which is implemented as a one or more policies (which are just rules), which are in turn standardized in a set of procedures. Its important to have all 3, because controls are very vague, policies are often general and broad in nature, which means to provide consistent quality results we require prescriptive procedures. At Joyent we call these “Standard Operating Procedures” (SOP).

The whole point of an SOP is to produce consistent results regardless of who’s using it. That means that all SOP’s need to be in a similar, familiar, and easy to follow format that is suitable to anyone who may need to use it. That, therefore, means that to get those consistent results there can be no room for ambiguity, it must be explicit and convey any necessary context along with it. Ambiguity is the mortal enemy of consistency. Case in point, if you’ve ever been asked to recompile software with a large number of configure flags, if your unable to determine which flags were used in the past you’ll go cold with anxiety over whether or not your building it properly. When you go back and ask who it was built in the past someone might say “Don’t you know how to compile software?” and the answer is likely going to be “Yes I do, but I don’t know how YOU compile software.” Whats important is that the person implementing a procedure be given all the information and context necessary to understand, and if necessary, interpret the information as appropriate for the given situation.

The first key to better SOP’s is to provide a template for others to follow. Without a standard template each author will write the procedure in their own unique style. Some people will write you a book, others will just paste some lines from their terminal into a code block. The template therefore must enforce a certain flow that ensures we include all the needed information but in a concise and complete way. Plus, we want SOP creators to focus entirely on writing the content, not debating the format.

Here is the template I use for Joyent Operations SOPs (in Confluence markup):

* Author:  {page-info:created-user}  created at: {page-info:created-date}
* Version: 1
* Revisions: {page-info:current-version}
* Reviewed by: (User @ date)
* Time to implement: 1hr
* Products this applies to: (SKU1)

{toc}

h1. Description & Scope

h1. Prerequisites

* Root access to node
* [SOP-222: Something|SOP-222: Something]
* [SOP-224: Something else|SOP-224: Something else]

h1. Procedure


h3. Step 1: Do this

{noformat}
Example
{noformat}

h3. Step 2: Do that

h3. ...

h1. Procedure Validation

# Login and verify external connectivity (ping google.com)
# curl zone IP address, page returns
# etc.

h1. Notes/Jira Examples

* [http://confluence.atlassian.com/display/DOC/JIRA+Issues+Macro]
* [http://confluence.atlassian.com/display/DOC/JIRA+Portlet+Macro]
* [https://studio.plugins.atlassian.com/browse/CONFJIRA-154]

Lets step through the above template.

All SOPs must be numbered for easy reference. Even the template itself is SOP-000. The SOP title is in the form: “SOP-102 Creating LDAP Users”, for instance.

The top of the SOP is full of metadata. The author, creation date, major version number and number of revisions made and products (or projects or whatever) that this SOP applies to. You’ll notice 2 other fields: “Reviewed By” and “Time to implement”. These are perhaps the most important of all. After an SOP is created it must be reviewed by someone else in the group, preferably with as little knowledge of the subject as possible. They should read and follow the SOP as written, starting a timer when the begin and stopping the timer when they are complete… it is that stopwatch time which becomes the “time to implement”. This is extremely important, the time estimate for implementation by the author will be way too short because they know what they are doing, the time it takes a complete n00b will be more useful and truthful.

Moving on through the template, “Description and Scope” are where we provide context. What are we talking about, what does it entail at a high level, what does this impact, etc. We want to include as much information as possible to set the stage for the procedure that follows. Then we include a bulleted list of “Prerequisites”. The single most common part of any procedure that gets skimped on is the prerequisites and they are also generally the most time consuming.

The meat of the SOP is the procedure itself. I strongly believe these must be in a “Step 1… Step 2… Step 3” format; it must be easy and intuitive to follow and in some cases may be used as a checklist during sensitive procedures. Its important that these truly start at the beginning and go to the end. “Step 1: Login to server X” may be overly simplistic but necessary for clarity if multiple machines are involved. I also like to have the final step be “Done” to make it clear that you have reached the end.

Just as important as the procedure is the “Validation Steps”… to ensure a quality job we must not only preform the proceedure but validate it in one or more ways to ensure it was really done right. This has the added side effect of giving the person doing the work the satisfaction that it was done properly and they didn’t screw something up along the way.

Lastly is a place to include external links as appropriate. If possible I like to link in tickets (we use Jira) which have relied on the SOP before, so that if by chance there is some confusion they can find examples of the work being done in the past.

An optional section that I’ve used before is a “Rationale”. In this section you would include notes on why you chose to implement the procedure in the way that you did. This allows for continuous improvement of the SOP. In most cases there are many ways to solve a problem, conveying why you chose the method you did will help you hone the procedure in the future while learning from the past. Without it your likely to have regression or duplication crop up.

This is the model that we’ve used at Joyent for several years and it has stood the test of time. I believe it to be a very solid standard for writing SOP’s and sharing knowledge within the organization and avoiding any one single person becoming a constraint. If you have refinements or a better method, I’d love to learn about it.