Writing a Better SOP

Within an ops team you should have 3 primary types of governance enablers: controls, policies and processes. A control is a guiding principle, which is implemented as a one or more policies (which are just rules), which are in turn standardized in a set of procedures. Its important to have all 3, because controls are very vague, policies are often general and broad in nature, which means to provide consistent quality results we require prescriptive procedures. At Joyent we call these “Standard Operating Procedures” (SOP).

The whole point of an SOP is to produce consistent results regardless of who’s using it. That means that all SOP’s need to be in a similar, familiar, and easy to follow format that is suitable to anyone who may need to use it. That, therefore, means that to get those consistent results there can be no room for ambiguity, it must be explicit and convey any necessary context along with it. Ambiguity is the mortal enemy of consistency. Case in point, if you’ve ever been asked to recompile software with a large number of configure flags, if your unable to determine which flags were used in the past you’ll go cold with anxiety over whether or not your building it properly. When you go back and ask who it was built in the past someone might say “Don’t you know how to compile software?” and the answer is likely going to be “Yes I do, but I don’t know how YOU compile software.” Whats important is that the person implementing a procedure be given all the information and context necessary to understand, and if necessary, interpret the information as appropriate for the given situation.

The first key to better SOP’s is to provide a template for others to follow. Without a standard template each author will write the procedure in their own unique style. Some people will write you a book, others will just paste some lines from their terminal into a code block. The template therefore must enforce a certain flow that ensures we include all the needed information but in a concise and complete way. Plus, we want SOP creators to focus entirely on writing the content, not debating the format.

Here is the template I use for Joyent Operations SOPs (in Confluence markup):

* Author:  {page-info:created-user}  created at: {page-info:created-date}
* Version: 1
* Revisions: {page-info:current-version}
* Reviewed by: (User @ date)
* Time to implement: 1hr
* Products this applies to: (SKU1)

{toc}

h1. Description & Scope

h1. Prerequisites

* Root access to node
* [SOP-222: Something|SOP-222: Something]
* [SOP-224: Something else|SOP-224: Something else]

h1. Procedure

h3. Step 1: Do this

{noformat}
Example
{noformat}

h3. Step 2: Do that

h3. ...

h1. Procedure Validation

# Login and verify external connectivity (ping google.com)
# curl zone IP address, page returns
# etc.

h1. Notes/Jira Examples

* [http://confluence.atlassian.com/display/DOC/JIRA+Issues+Macro]
* [http://confluence.atlassian.com/display/DOC/JIRA+Portlet+Macro]
* [https://studio.plugins.atlassian.com/browse/CONFJIRA-154]

Lets step through the above template.

All SOPs must be numbered for easy reference. Even the template itself is SOP-000. The SOP title is in the form: “SOP-102 Creating LDAP Users”, for instance.

The top of the SOP is full of metadata. The author, creation date, major version number and number of revisions made and products (or projects or whatever) that this SOP applies to. You’ll notice 2 other fields: “Reviewed By” and “Time to implement”. These are perhaps the most important of all. After an SOP is created it must be reviewed by someone else in the group, preferably with as little knowledge of the subject as possible. They should read and follow the SOP as written, starting a timer when the begin and stopping the timer when they are complete… it is that stopwatch time which becomes the “time to implement”. This is extremely important, the time estimate for implementation by the author will be way too short because they know what they are doing, the time it takes a complete n00b will be more useful and truthful.

Moving on through the template, “Description and Scope” are where we provide context. What are we talking about, what does it entail at a high level, what does this impact, etc. We want to include as much information as possible to set the stage for the procedure that follows. Then we include a bulleted list of “Prerequisites”. The single most common part of any procedure that gets skimped on is the prerequisites and they are also generally the most time consuming.

The meat of the SOP is the procedure itself. I strongly believe these must be in a “Step 1… Step 2… Step 3″ format; it must be easy and intuitive to follow and in some cases may be used as a checklist during sensitive procedures. Its important that these truly start at the beginning and go to the end. “Step 1: Login to server X” may be overly simplistic but necessary for clarity if multiple machines are involved. I also like to have the final step be “Done” to make it clear that you have reached the end.

Just as important as the procedure is the “Validation Steps”… to ensure a quality job we must not only preform the proceedure but validate it in one or more ways to ensure it was really done right. This has the added side effect of giving the person doing the work the satisfaction that it was done properly and they didn’t screw something up along the way.

Lastly is a place to include external links as appropriate. If possible I like to link in tickets (we use Jira) which have relied on the SOP before, so that if by chance there is some confusion they can find examples of the work being done in the past.

An optional section that I’ve used before is a “Rationale”. In this section you would include notes on why you chose to implement the procedure in the way that you did. This allows for continuous improvement of the SOP. In most cases there are many ways to solve a problem, conveying why you chose the method you did will help you hone the procedure in the future while learning from the past. Without it your likely to have regression or duplication crop up.

This is the model that we’ve used at Joyent for several years and it has stood the test of time. I believe it to be a very solid standard for writing SOP’s and sharing knowledge within the organization and avoiding any one single person becoming a constraint. If you have refinements or a better method, I’d love to learn about it.

2 Responses to “Writing a Better SOP”

  1. UX-admin says:

    I do the same thing. We call them “operating manuals” (“OP”).

    When a new (sub)component is packaged into an OS package, it goes through the testing phase on clean-room test systems. The operating manual is part of the testing process. At that stage, not only is the component being tested, but the documentation is being tested as well.

    If the component passes the test phase, and it passes within the test window, it will be approved by the approval instance in the system, which automatically promotes it to the next stage, the product acceptance.

    The component must now pass the product acceptance phase within the alloted window. At this stage, it is the the acutal customer / consumer of the component testing whether the component is producing the desired result / work. Once again, the documentation is tested at this stage as well. If the documentation is lacking, incomplete, or incorrect, the component will be marked as failed. The component owner must fix the documentation within the time alloted inside of the PTA window, or the entire process is rolled back, and a whole new process must be started.

    Assuming the component tested has made it this far, the approval will be marked in the system, and the component will be marked production ready. However, the system will allow it do be deployed only within the “PROD” phase window, i.e. assuming the testing and acceptance pass within 15 minutes, but the PROD window does not start until a week later, the system will not allow for deployment. If the system administrator chooses to override and deploy the component anyway, the system will automatically report her or him to the change control board.

    The phone will ring within 35 seconds, and an explanation will be demanded.

    If I may, I recommend you add “Approved by:” and “Approval date:” so that someday you can plug this information into a fully automated change / asset management / deployment system.

  2. UX-admin says:

    Of course in our case, most of our operating manuals detail running pkgadd(1M) and pkgrm(1M) of the component. The interaction portion of the component is usually written in collaboration with the consumer of the product.

    The actual configuration work is encapsulated into the individual packages by dedicated teams of engineers specializing in their particular component (“component owners”). The consumer of the component is the “system owner”, since they are paying for the systems to run the component out of their budget.

    The system administrator’s role is 2nd level support and deployment. They provide limited root cause analisys. In case of operating system issues, our Solaris engineering is 3rd level support. They will be given limited time access to production, but only to collect data. The issue is fixed in the next revision cycle of the component, which must undergo all the cycles (development, testing, acceptance, production) I described above.

    1st level support is performed by the hardware vendor, which has a small team on site and a warehouse of parts. Vendor engineers do a walkthrough twice a day and work with 2nd level support to coordinate hardware servicing windows.

    If the component itself is faulty or needs revision, the dedicated engineering team (“component owner”) will be given time-restricted access to the system to gather telemetry (they are not allowed to change anything), and the findings will then be used to devise a fix in development and testing. The issue is fixed on the next deployment cycle.

    Because of this, the organization does not recognize “time of crisis”. We go through this process no matter what happens, and no matter how “bad” things might appear. Doing this for years, our production has become super-stable so that we almost never experience what could otherwise be termed as “time of crisis”.

    The part I have not mentioned yet are specification and architectural documents. Those are developed by our component owner teams, and must be written in such a way that an operating manual can be derived from them. Perhaps I will write more about it some other time.