Archive for September, 2012

Writing a Better SOP

Tuesday, September 25th, 2012

Within an ops team you should have 3 primary types of governance enablers: controls, policies and processes. A control is a guiding principle, which is implemented as a one or more policies (which are just rules), which are in turn standardized in a set of procedures. Its important to have all 3, because controls are very vague, policies are often general and broad in nature, which means to provide consistent quality results we require prescriptive procedures. At Joyent we call these “Standard Operating Procedures” (SOP).

The whole point of an SOP is to produce consistent results regardless of who’s using it. That means that all SOP’s need to be in a similar, familiar, and easy to follow format that is suitable to anyone who may need to use it. That, therefore, means that to get those consistent results there can be no room for ambiguity, it must be explicit and convey any necessary context along with it. Ambiguity is the mortal enemy of consistency. Case in point, if you’ve ever been asked to recompile software with a large number of configure flags, if your unable to determine which flags were used in the past you’ll go cold with anxiety over whether or not your building it properly. When you go back and ask who it was built in the past someone might say “Don’t you know how to compile software?” and the answer is likely going to be “Yes I do, but I don’t know how YOU compile software.” Whats important is that the person implementing a procedure be given all the information and context necessary to understand, and if necessary, interpret the information as appropriate for the given situation.

The first key to better SOP’s is to provide a template for others to follow. Without a standard template each author will write the procedure in their own unique style. Some people will write you a book, others will just paste some lines from their terminal into a code block. The template therefore must enforce a certain flow that ensures we include all the needed information but in a concise and complete way. Plus, we want SOP creators to focus entirely on writing the content, not debating the format.

Here is the template I use for Joyent Operations SOPs (in Confluence markup):

* Author:  {page-info:created-user}  created at: {page-info:created-date}
* Version: 1
* Revisions: {page-info:current-version}
* Reviewed by: (User @ date)
* Time to implement: 1hr
* Products this applies to: (SKU1)

{toc}

h1. Description & Scope

h1. Prerequisites

* Root access to node
* [SOP-222: Something|SOP-222: Something]
* [SOP-224: Something else|SOP-224: Something else]

h1. Procedure

h3. Step 1: Do this

{noformat}
Example
{noformat}

h3. Step 2: Do that

h3. ...

h1. Procedure Validation

# Login and verify external connectivity (ping google.com)
# curl zone IP address, page returns
# etc.

h1. Notes/Jira Examples

* [http://confluence.atlassian.com/display/DOC/JIRA+Issues+Macro]
* [http://confluence.atlassian.com/display/DOC/JIRA+Portlet+Macro]
* [https://studio.plugins.atlassian.com/browse/CONFJIRA-154]

Lets step through the above template.

All SOPs must be numbered for easy reference. Even the template itself is SOP-000. The SOP title is in the form: “SOP-102 Creating LDAP Users”, for instance.

The top of the SOP is full of metadata. The author, creation date, major version number and number of revisions made and products (or projects or whatever) that this SOP applies to. You’ll notice 2 other fields: “Reviewed By” and “Time to implement”. These are perhaps the most important of all. After an SOP is created it must be reviewed by someone else in the group, preferably with as little knowledge of the subject as possible. They should read and follow the SOP as written, starting a timer when the begin and stopping the timer when they are complete… it is that stopwatch time which becomes the “time to implement”. This is extremely important, the time estimate for implementation by the author will be way too short because they know what they are doing, the time it takes a complete n00b will be more useful and truthful.

Moving on through the template, “Description and Scope” are where we provide context. What are we talking about, what does it entail at a high level, what does this impact, etc. We want to include as much information as possible to set the stage for the procedure that follows. Then we include a bulleted list of “Prerequisites”. The single most common part of any procedure that gets skimped on is the prerequisites and they are also generally the most time consuming.

The meat of the SOP is the procedure itself. I strongly believe these must be in a “Step 1… Step 2… Step 3″ format; it must be easy and intuitive to follow and in some cases may be used as a checklist during sensitive procedures. Its important that these truly start at the beginning and go to the end. “Step 1: Login to server X” may be overly simplistic but necessary for clarity if multiple machines are involved. I also like to have the final step be “Done” to make it clear that you have reached the end.

Just as important as the procedure is the “Validation Steps”… to ensure a quality job we must not only preform the proceedure but validate it in one or more ways to ensure it was really done right. This has the added side effect of giving the person doing the work the satisfaction that it was done properly and they didn’t screw something up along the way.

Lastly is a place to include external links as appropriate. If possible I like to link in tickets (we use Jira) which have relied on the SOP before, so that if by chance there is some confusion they can find examples of the work being done in the past.

An optional section that I’ve used before is a “Rationale”. In this section you would include notes on why you chose to implement the procedure in the way that you did. This allows for continuous improvement of the SOP. In most cases there are many ways to solve a problem, conveying why you chose the method you did will help you hone the procedure in the future while learning from the past. Without it your likely to have regression or duplication crop up.

This is the model that we’ve used at Joyent for several years and it has stood the test of time. I believe it to be a very solid standard for writing SOP’s and sharing knowledge within the organization and avoiding any one single person becoming a constraint. If you have refinements or a better method, I’d love to learn about it.

Configuration Management on SmartOS

Friday, September 21st, 2012

Over the last couple days I added a bunch of SmartOS documentation on getting the major configuration management solutions working. Pointers for CFengine3 and Puppet are there with the basics on getting started. I added extensive documentation for Chef, suitable for even users entirely new to Chef. I’ve also populated a Github repo with cookbooks, Knife bootstraps, and a full framework for using Chef Solo with SmartOS.

Go find all the docs in the SmartOS Wiki: Configuration Management on SmartOS

A Return to Linux on the Workstation

Thursday, September 20th, 2012

In my day to day work I rely on two systems, a MacBook Pro and a custom built PC workstation. My Mac is used for all my travel needs and communications (email, Jabber, Skype, etc). All my “real work” is done on the workstation which I refresh to the latest and greatest every 3-4 years, run dual headed, etc. Up until about 30 days ago my primary workstation ran some variety of Solaris for nearly 10 years, starting with Solaris 9 when X86 became viable on X86, then OpenSolaris and the various Solaris Express releases and finally Solaris 11 Beta. It was one month ago today that I finally re-installed it with Ubuntu, returning me to Linux officially. Times are a’ changin’… so I thought I’d share the tale of my long experience and the events that brought me back to Linux on the desktop.

As I stated in a recent talk, and then was humored to see quoted on Twitter a couple times since, I never really intended to run a “Solaris Desktop”. I didn’t want a desktop, rather I wanted a server on my desk. Building a desktop operating system is really hard, it involves supporting all manner of new and strange hardware. Its hard enough on desktop PCs but its absolutely redicuous when you consider all the variations of laptops. On my workstation I always installed a standard Intel e1000g dual port NIC, a Sound Blaster 16 or 128, and a well supported NVidia graphics card. So long as I could start an X server on dual displays and start Enlightenment, my window manager of choice, I was happy. The only apps I rely on are a browser and several dozen Eterms… little else. What was important to me was that I had a platform on my desk with which to experiment and prototype on Solaris for later implementation in the data center.

With the addition of ZFS, Solaris became an extremely powerful testing platform. Several large disks in my workstation formed a Zpool on which everything but the base OS was installed. The OS root itself was on a small 16GB SSD (it was bad ass once upon a time). This allowed me to frequently do fresh installs of new releases of Solaris and OpenSolaris. After install, I just imported the Zpool which put my home dir, /usr/local, /opt, etc back into place and I was running again.

What has always bugged me about Solaris is that the software packaging solutions have always been aweful. For a long time we were limited to whatever shipped with Solaris or was available from Blastwave. But Blastwave was little comfort because so frequently a single package install would have an absurd dependency on some very foundational package therefore forcing an upgrade of everything, like it or not, which invariably would break something. In my former Linux days I was fond of Linux from Scratch and latter became a fan of Gentoo, therefore my solution for Solaris was to hand build all my fundamental applications myself and then simply drag those binaries from release to release for a very long period of time. While I appreciated having the latest and greatest Solaris on my desk, I certainly missed the ease of simply installing an RPM and being done. The idea of trying the latest KDE was a seemingly insurmountable challenge and waste of time.

About 30 days ago two factors caused me to finally throw in the towel on Solaris as a workstation OS.

The first was that I finally joined the club of folks who have spilled liquid on their MacBooks. After 3 years of faithful service my Mac was dead. This happened on a Saturday and I suddenly realized that on Monday I’d be unable to join our corporate Jabber channels and I wouldn’t have Skype access. Suddenly I became aware of how much I was relying on my MacBook for daily communication and that I was essentially going to be cut off. Getting all these types of services working on my Solaris workstation was possible, but hardly seemed worth the effort and I only had a day to get back to full capability to be ready for Monday morning.

The second was that Solaris is dead. Illumos is the future of the platform and the desktop options there are very weak. All my work these days is on SmartOS, which is dedicated hypervisor platform, so there was no way I was going to whip it into a workstation platform in short order, not to mention that it’d be a fruitless exercise even if accomplished. It was clear that having a server on my desk that also possessed the basics required for a passable X environment was at an end. Besides that, thanks to KVM support in SmartOS it was becoming increasingly clear that I was completely out of touch with the Linux world which I was now supporting more frequently as a guest OS. And, last but not least, Linux now has ZFS support, so I could theoretically install Linux, get ZFS supported added, and then import all my important filesystems. It was time to return to a Linux workstation.

I’m getting older and lazier, so going back to the Gentoo lifestyle wasn’t interesting to me. Ubuntu continues to be all the rage, so I decided Ubuntu 12.04 was the way to go. And, I turned out to be right… within 4 hours of the MacBook toasting I had installed Ubuntu 12.04, gotten my displays working properly, installed all the software I needed, including Skype and Enlightenment, added ZFS support and mounted my home directory and was looking at my desktop environment as though nothing had happened. It was a wonderful experience.

Getting ZFS Support working with Ubuntu is very simple. Simply install the ZFS for Linux PPA packages and reboot. The only mistake I made was that I initially had installed Ubuntu 32bit, thanks to my outdated Linux knowledge of compatibility issues running a 64bit kernel. On the 32bit kernel ZFS took almost an hour to locate and import the pool… after I reinstalled Ubuntu 12.04 64bit and adding in the ZFS packages again, my Zpool imported just fine. One thing that helped me here was that my pools are very old; in order to provide maximum flexibility in which OpenSolaris release I used, I never allowed my pools to be upgraded, therefore allowing me to run older OS releases if needed, therefore ZFS for Linux had no problems importing my old version pools.

After using ZFS on Linux for some time now I can say that it works very well but the performance is less than stellar. The performance is good enough that I NFS export all my old file systems for use, but bad enough that I created a fresh home directory on ext4.

I did play with Unity a bit before switching back to Enlightenment DR16 (the best window manager ever created). Unity is a really excellent desktop and a first rate contender against Windows 7/8 and even OS X… but ultimately I still prefer the speed and minimalism of an old school window manager. The only thing that actually bugged me about Unity was the way they tried to be very clever about window titles… they sort of blur out from left to right. While I realize its a nifty visual device, to me it looked like a theming mistake and I disliked windows with the title “Firef…”.

One thing that did surprise me about my return to Linux was how little the desktop applications had changed. Finding that Pidgin was still the IM client of choice threw me for a loop. I experimented with Empathy but had horrific stability issues, which is a shame because its a much nicer client than Pidgin. Ultimately I found a theme mix that worked for me and settled on Pidgin was but was sad there weren’t more viable options (yes, there are alternatives, but they sucked more than Pidgin). Getting Skype running easily was a pleasant surprise, no pain not problems and people I called told me it was the best I’d ever sounded on Skype. The other various apps were less exciting than I had hoped, I was sad that Eye of Gnome hadn’t died a long time ago. I think the two high points were realizing that I could use the Arduino IDE on my workstation and looking at Shotwell. Shotwell is an amazing application, but its not enough to convince me to move all my photos out of iPhoto.

In the end, 3 days later I had been issued a replacement MacBook Pro which I got just as Mountain Lion released. Thankfully installing Mountain Lion and then recovering from my TimeMachine backup went well and I was back to my normal workflows. While I’m sad that, at least for me, the era of Solaris as a viable workstation had come to an end, I am glad at all the new life Illumos distros have as first class server OS’s. I may not have the server on my desk any more but the era of the all-things-to-all-people OS is, imho, done.