Archive for the ‘DevOps’ Category

Why SysAdmin’s Can’t Code

Sunday, May 12th, 2013

Most systems administrators are quick, perhaps too quick, to tell you “I’m not a coder.”  Oddly, this admission normally comes after boasting about how many programming languages they know or have used.  Why is this?  Can this be changed?  Here is my 5 step plan on how any SA can become an honest to goodness programmer.

Step 1: Find a problem you care about solving, for yourself

SysAdmin’s don’t actually use tools, they study them. The point isn’t to solve a problem but rather to know how to solve a problem if the need ever arises. SA’s are filling their tool chests full of handy hints and useful solutions for nearly any problem that can be encountered. Whats more, this is a subconscious tendency…. they want to use things, but because their true goal is simply learning and knowing the tool, when they wish to build something with this thing they’ve just learned, they blank out. Learning is the goal, goal achieved, next tool.

This causes any SA approaching programming languages to become masters of hello_world in a dozen languages. Again, the goal is to have a basic understanding of the language and then to move on.

The only way to break out of this rut is to find a problem you actually yourself have, and solving it. Don’t write a program for someone else, write it for yourself. Write something that integrates with your LDAP servers or pulls metrics and stores them in a database… whatever it is, make it practical, not an abstract academic exercise.

Step 2: Pick a language and stick with it

Because they are masters of adaptation and have so much hello_world foo, they are likely switch languages before making any real progress. If you want to write in C, then do that, but don’t switch to Ruby or Python because writing regex in C is a PITA…. push through and learn how to do regex in C.

SysAdmin’s also need to know what everyone else knows…. or more. This is a point of pride, master of all trades. Just because Go is on the rise, don’t think that PHP or Python is obsolete and useless. If you want to use Go, then fine, but stick with it. If you want to use Clojure, fine, but don’t just feel lame because suddenly people are talking about Node.js.

To facilitate this, write lots and lots of small programs that exercise different parts of the language and let you build a deeper knowledge of the core language. Save all those in a source control repository for later review and to help you build your confidence.

Step 3: Scripting isn’t programming

SysAdmin’s are tool masters… they know as many of them as possible. This is why scripting is a natural thing for us, it allows us to plug different tools together. But this isn’t really programming, its plumbing. This becomes more apparent when you start programming in a non-shell language. A sysadmin’s code tends to have a lot of execs… that is, essentially writing shell scripts in a non-shell language.

Step 4: Modules and libraries aren’t cheating

SysAdmin’s are purists and have a lot of ego. Using a module or library is akin to cheating. This is why many SysAdmin’s only know basic C, they don’t have the drive to implement their own protocol implementations, but view using a library as cheating…. so they are masters of pointers and little more. Many programmers look at modules and libraries like SA’s look at programs themselves, as building blocks to be plugged together to achieve a goal.

Don’t exec a tool, use a module or library, and don’t feel weird about doing it.

Step 5: Don’t think about other people

One of the biggest impediments to coding is embarrassment. That your code isn’t good enough or its formatted wrong or your not doing things in a certain “normal” way. This is the equivalent of going to the first day of school and being made to feel like an outcast for not having the right clothes or shoes… you like it and it works for you, but some how you’re wrong and just not cool enough to fit in. For anyone this is discouraging, but for a SysAdmin who’s pride is in being a master of all things its unbearable and causes them to throw in the “I’m not a coder” towel.

The solution is to just not give a shit. If your lines of code exceed 80 characters, so what. If you use curly braces instead of “end”, fine. Just don’t listen to those people. What matters is functional programs, not pretty syntax. If you truly pick a language and stick with it, at some point you’ll fall into common practice fairly naturally, but until then don’t allow yourself to give up just because you constantly feel like some imaginary critic is yelling at you while you code. This is why its important to, first and foremost, write for yourself to solve problems that you yourself have.

Hadoop Analysis of Apache Logs Using Flume-NG, Hive and Pig

Thursday, December 27th, 2012

Big Data is the hotness, there is no doubt about it.  Every year its just gotten bigger and bigger and shows no sign of slowing.  There is a lot out there about big data, but despite the hype, there isn’t a lot of good technical content for those who want to get started.  The lack of technical how-to info is made worse by the fact that many Hadoop projects have moved their documentation around over time and Google searches commonly point to obsolete docs.  My intent here is to provide some solid guidance on how to actually get started with practical uses of Hadoop and to encourage others to do the same.

From an SA perspective, the most interesting Hadoop sub-projects have been those for log transport, namely Scribe, Chukwa, and Flume.  Lets examine each.

Log Transport Choices

Scribe was created at Facebook and got a lot of popularity early on due to adoption at high profile sites like Twitter, but development has apparently ceased  and word is that Facebook stopped using it themselves.  So Scribe is off my list.

Chukwa is a confusing beast, its said to be distributed with Hadoop’s core but its just an old version in the same sub-directory of the FTP site, the actual current version is found under the incubator sub-tree.  It is a very comprehensive solution, including a web interface for log analysis, but that functionality is based on HBase, which is fine if you want to use HBase but may be a bit more than you wish to chew off for simple Hive/Pig analysis.  Most importantly, the major Hadoop distributions from HortonWorks, MapR, and Cloudera use Flume instead.  So if your looking for a comprehensive toolset for log analysis, Chukwa is worth checking out, but if you simply need to efficiently get data into Hadoop for use by other Hadoop components, Flume is the clear choice.

That brings us to Flume, more specifically Flume-NG.  The first thing to know about Flume is that there were major changes to Flume pre and post 1.0, major enough that they took to refering to pre 1.0 as “Flume OG” (“Old generation” or “Origonal Gangsta” depending on your mood) and the new post 1.0 releases as “Flume NG”.  Whenever looking at documentation or help on the web about Flume be certain as to which you are looking at!  In particular, stay away from the Flume CWiki pages,  refer only to the flume.apache.org.  I say that because there is so much old cruft in the CWiki pages that you can be easily mislead and become frustrated, so just avoid it.

Now that we’ve thinned out the available options, what can we do with Flume?

Getting Started with Flume

Flume is a very sophisticated tool for transporting data.  We are going to focus on log data, however it can transport just about anything you throw at it.  For our purposes we’re going to use it to transport Apache log data from a web server back to our Hadoop cluster and store it in HDFS where we can then operate on it using other Hadoop tools.

Flume NG is a java application that, like other Hadoop tools, can be downloaded, unpacked, configured and run, without compiling or other forms of tinkering.  Download the latest “bin” tarball and untar it into /opt and rename or symlink to “/opt/flume” (it doesn’t matter where you put it, this is just my preference).  You will need to have Java already installed.

Before we can configure Flume its important to understand its architecture.  Flume runs as an agent.  The agent is sub-divided into 3 categories: sources, channels, and sinks.  Inside the Flume agent process there is a pub-sub flow between these 3 components.  A source accepts or retrieves data and sends it into a channel.  Data then queues in the channel.  A sink takes data from the channel and does something with it.  There can be multiple sources, multiple channels, and multiple sinks per agent.  The only important thing to remember is that a source can write to multiple channels, but a sink can draw from only one channel.

Lets take an example.  A “source” might tail a file.  New log lines are sent into a channel where they are queued up.  A “sink” then extracts the log lines from the channel and writes them into HDFS.

At first glance this might appear overly complicated, but the distinct advantage  here is that the channel de-couples input and output, which is important if you have performance slowdowns in the sinks.  It also allows the entire system to be plugin-based.  Any number of new sinks can be created to do something with data… for instance, Casandra sinks are available, there is an IRC sink for writing data into an IRC channel.  Flume is extremely flexible thanks to this architecture.

In the real world we want to collect data from a local file, send it across the network and then store it centrally.  In Flume we’d accomplish this by chaining agents together.  The “sink” of one agent sends to the “source” of another.  The standard method of sending data across the network with Flume is using Avro.  For our purposes here you don’t need to know anything about Avro except one of the things it can do is to move data over the network.  Here is what this ultimately looks like:

So on our web server, we create a /opt/flume/conf/flume.conf that looks like this:

## Flume NG Apache Log Collection
## Refer to https://cwiki.apache.org/confluence/display/FLUME/Getting+Started
##
# http://flume.apache.org/FlumeUserGuide.html#exec-source
agent.sources = apache
agent.sources.apache.type = exec
agent.sources.apache.command = gtail -F /var/log/httpd/access_log
agent.sources.apache.batchSize = 1
agent.sources.apache.channels = memoryChannel
agent.sources.apache.interceptors = itime ihost itype
# http://flume.apache.org/FlumeUserGuide.html#timestamp-interceptor
agent.sources.apache.interceptors.itime.type = timestamp
# http://flume.apache.org/FlumeUserGuide.html#host-interceptor
agent.sources.apache.interceptors.ihost.type = host
agent.sources.apache.interceptors.ihost.useIP = false
agent.sources.apache.interceptors.ihost.hostHeader = host
# http://flume.apache.org/FlumeUserGuide.html#static-interceptor
agent.sources.apache.interceptors.itype.type = static
agent.sources.apache.interceptors.itype.key = log_type
agent.sources.apache.interceptors.itype.value = apache_access_combined

# http://flume.apache.org/FlumeUserGuide.html#memory-channel
agent.channels = memoryChannel
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 100

## Send to Flume Collector on 1.2.3.4 (Hadoop Slave Node)
# http://flume.apache.org/FlumeUserGuide.html#avro-sink
agent.sinks = AvroSink
agent.sinks.AvroSink.type = avro
agent.sinks.AvroSink.channel = memoryChannel
agent.sinks.AvroSink.hostname = 1.2.3.4
agent.sinks.AvroSink.port = 4545

## Debugging Sink, Comment out AvroSink if you use this one
# http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
#agent.sinks = localout
#agent.sinks.localout.type = file_roll
#agent.sinks.localout.sink.directory = /var/log/flume
#agent.sinks.localout.sink.rollInterval = 0
#agent.sinks.localout.channel = memoryChannel

This configuration looks overwhelming at first, but it breaks down simply into an “exec” source, a “memory” channel, and an “Avro” sink, with additional parameters specified for each. The syntax for each is in the following form:

agent_name.sources = source1 source2 ...
agent_name.sources.source1.type = exec
...

agent_name.channel = channel1 channel2 ...
agent_name.channel.channel1.type = memory
...

agent_name.sinks = sink1 sink2 ...
agent_name.sinks.sink1.type = avro
...

In my example the agent name was “agent”, but you can name it anything you want. You will specify the agent name when you start the agent, like this:

$ cd /opt/flume
$ bin/flume-ng agent -f conf/flume.conf -n agent

Now that our agent is running on the web server, we need to setup the other agent which will deposit logs lines into HDFS. This type of agent is commonly called a “collector”. Here is the config:

## Sources #########################################################
## Accept Avro data In from the Edge Agents
# http://flume.apache.org/FlumeUserGuide.html#avro-source
collector.sources = AvroIn
collector.sources.AvroIn.type = avro
collector.sources.AvroIn.bind = 0.0.0.0
collector.sources.AvroIn.port = 4545
collector.sources.AvroIn.channels = mc1 mc2

## Channels ########################################################
## Source writes to 2 channels, one for each sink (Fan Out)
collector.channels = mc1 mc2

# http://flume.apache.org/FlumeUserGuide.html#memory-channel
collector.channels.mc1.type = memory
collector.channels.mc1.capacity = 100

collector.channels.mc2.type = memory
collector.channels.mc2.capacity = 100

## Sinks ###########################################################
collector.sinks = LocalOut HadoopOut

## Write copy to Local Filesystem (Debugging)
# http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
collector.sinks.LocalOut.type = file_roll
collector.sinks.LocalOut.sink.directory = /var/log/flume
collector.sinks.LocalOut.sink.rollInterval = 0
collector.sinks.LocalOut.channel = mc1

## Write to HDFS
# http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
collector.sinks.HadoopOut.type = hdfs
collector.sinks.HadoopOut.channel = mc2
collector.sinks.HadoopOut.hdfs.path = /flume/events/%{log_type}/%{host}/%y-%m-%d
collector.sinks.HadoopOut.hdfs.fileType = DataStream
collector.sinks.HadoopOut.hdfs.writeFormat = Text
collector.sinks.HadoopOut.hdfs.rollSize = 0
collector.sinks.HadoopOut.hdfs.rollCount = 10000
collector.sinks.HadoopOut.hdfs.rollInterval = 600

This configuration is a little different because the source accepts Avro network events and then sends them into 2 memory channels (“fan out”) which feed 2 different sinks, one for HDFS and another for a local log file (for debugging). We start this agent like so:

# bin/flume-ng agent -f conf/flume.conf -n collector

Once both sides are up, you should see data moving. Use “hadoop fs -lsr /flume” to examine files there and if you included the file_roll sink, look in /var/log/flume.

# hadoop fs -lsr /flume/events
drwxr-xr-x   - root supergroup          0 2012-12-24 06:17 /flume/events/apache_access_combined
drwxr-xr-x   - root supergroup          0 2012-12-24 06:17 /flume/events/apache_access_combined/cuddletech.com
drwxr-xr-x   - root supergroup          0 2012-12-24 09:50 /flume/events/apache_access_combined/cuddletech.com/12-12-24
-rw-r--r--   3 root supergroup     224861 2012-12-24 06:17 /flume/events/apache_access_combined/cuddletech.com/12-12-24/FlumeData.1356329845948
-rw-r--r--   3 root supergroup      85437 2012-12-24 06:27 /flume/events/apache_access_combined/cuddletech.com/12-12-24/FlumeData.1356329845949
-rw-r--r--   3 root supergroup     195381 2012-12-24 06:37 /flume/events/apache_access_combined/cuddletech.com/12-12-24/FlumeData.1356329845950

Flume Tunables & Gotcha’s

There are a lot of tunables to play with and carefully consider in the example configs above. I included the documentation links for each component and I highly recommend you review it. Lets specifically look at some things that might cause you frustration while getting started.

First, interceptors. If you look at our HDFS sink path, you’ll see the path includes “log_type”, “host”, and a date. That data is associated with an event when the source grabs it, it is meta-data headers on each event. You associate that data with the event using an “interceptor”. So look back at the source where we ‘gtail’ our log file and you’ll see that we’re using interceptors to associate the log_type, “host”, and date with each event.

Secondly, by default Flume’s HDFS sink writes out SequenceFiles. This seems fine until you run Pig or Hive and get inconsistent or usual results back. Ensure that you specify the “fileType” as “DataStream” and the “writeFormat” as “Text”.

Lastly, there are 3 triggers that will cause Flume to “roll” the HDFS output file: size, count, and interval. When Flume writes data, if any one of the triggers is true it will roll to use a new file. By default the count is 30 (seconds), size is 1024 (bytes), and count is 10. Think about that, if any of those is true the file is rolled. So you end up with a LOT of HDFS files, which may or may not be what you want. Setting any value to 0 disables that type of rolling.

Analysis using Pig

Pig is a great tool for the Java challenged. Its quick, easy, and repeatable. The only real challenge is in accurately describing the data your asking it to chew on.

The PiggyBank library can provide you with a set of loaders which can save you from regex hell. The following is an example of using Pig on my Flume ingested Apache combined format logs using the PiggyBank “CombinedLogLoader”:

# cd /opt/pig
# ./bin/pig
2012-12-23 10:32:56,053 [main] INFO  org.apache.pig.Main - Apache Pig version 0.10.0-SNAPSHOT (r: unknown) compiled Dec 23 2012, 10:29:56
2012-12-23 10:32:56,054 [main] INFO  org.apache.pig.Main - Logging error messages to: /opt/pig-0.10.0/pig_1356258776048.log
2012-12-23 10:32:56,543 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://10.12.29.198/
2012-12-23 10:32:57,030 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 10.12.29.198:9101

grunt> REGISTER /opt/pig-0.10.0/contrib/piggybank/java/piggybank.jar;
grunt> raw = LOAD '/flume/events/apache_access_combined/cuddletech.com/12-12-24/''
    USING org.apache.pig.piggybank.storage.apachelog.CombinedLogLoader
    AS (remoteAddr, remoteLogname, user, time, method, uri, proto, status, bytes, referer, userAgent);
grunt> agents = FOREACH raw GENERATE userAgent;
grunt> agents_uniq = DISTINCT agents;
grunt> DUMP agents_uniq;
...

(-)
(Motorola)
(Mozilla/4.0)
(RSSaggressor)
(Java/1.6.0_24)
(restkit/4.1.2)
(Blogtrottr/2.0)
(Mozilla/5.0 ())
(Recorded Future)
...

While Pig is easy enough to install (unpack and run), you must build the Piggybank JAR, which means you’ll need a JDK and Ant. On a SmartMachine with Pig installed in /opt/pig, it’d look like this:

# pkgin in sun-jdk6-6.0.26 apache-ant
# cd /opt/pig/
# ant
....
# cd /opt/pig/contrib/piggybank/java
# ant
....
jar:
     [echo]  *** Creating pigudf.jar ***
      [jar] Building jar: /opt/pig-0.10.0/contrib/piggybank/java/piggybank.jar

BUILD SUCCESSFUL
Total time: 5 seconds

Analysis using Hive

Similar to Pig, the challenge with Hive is really just describing the schema around the data. Thankfully there is assistance out there for just this problem.

[root@hadoop02 /opt/hive]# bin/hive
Logging initialized using configuration in jar:file:/opt/hive-0.9.0-bin/lib/hive-common-0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201212241029_318322444.txt
hive>
hive> CREATE EXTERNAL TABLE access(
    >   host STRING,
    >   identity STRING,
    >   user STRING,
    >   time STRING,
    >   request STRING,
    >   status STRING,
    >   size STRING,
    >   referer STRING,
    >   agent STRING)
    > ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    > WITH SERDEPROPERTIES (
    >   "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
    >   "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
    > )
    > STORED AS TEXTFILE
    > LOCATION '/flume/events/apache_access_combined/cuddletech.com/12-12-24/';
OK
Time taken: 7.514 seconds
hive>

Now you can query to your hearts content. Please note that in the above example if you omit the “EXTERNAL” keyword when creating the table that Hive will move your data into its own data warehouse directory, which may not be what you want.

Next Steps

Hadoop provides an extremely powerful set of tools to solve very big problems. Pig and Hive are easy to use and very powerful. Flume-NG is an excellent tool for reliably moving data and extremely extensible. There is a lot I’m not getting into here, like using file-backed or database backed channels in Flume to protect against node failure thus increasing delivery reliability, or using multi-tiered aggregation by using intermediate Flume agents (meaning, Avro Source to Avro Sink)… there is a lot of fun things to explore here. My hope is that I’ve provided you with an additional source of data to help you on your way.

If you start getting serious with Hadoop, I highly recommend you buy the following O’Reilly books for Hadoop, which are very good and will save you a lot of time wasted in trial-and-error:

A Friendly Warning

In closing, I feel it necessarily to point out the obvious. For most people there is no reason to do any of this. Hadoop is a Peterbilt for data. You don’t use a Peterbilt for a job that can be done with a Ford truck, its not worth the time, money and effort.

When I’ve asked myself “How big must data be for it to be big data?” I’ve come up with the following rule: If a “grep” of a file takes more than 5 minutes, its big. If the file can not be reasonably sub-divided to be smaller files or any query requires examining multiple files, then it might be Hadoop time.

For most logging applications, I strongly recommend either Splunk (if you can afford it) or using Rsyslog/Logstash and ElasticSearch, they are far more suited to the task with less hassle, less complexity and much more functionality.

Writing a Better SOP

Tuesday, September 25th, 2012

Within an ops team you should have 3 primary types of governance enablers: controls, policies and processes. A control is a guiding principle, which is implemented as a one or more policies (which are just rules), which are in turn standardized in a set of procedures. Its important to have all 3, because controls are very vague, policies are often general and broad in nature, which means to provide consistent quality results we require prescriptive procedures. At Joyent we call these “Standard Operating Procedures” (SOP).

The whole point of an SOP is to produce consistent results regardless of who’s using it. That means that all SOP’s need to be in a similar, familiar, and easy to follow format that is suitable to anyone who may need to use it. That, therefore, means that to get those consistent results there can be no room for ambiguity, it must be explicit and convey any necessary context along with it. Ambiguity is the mortal enemy of consistency. Case in point, if you’ve ever been asked to recompile software with a large number of configure flags, if your unable to determine which flags were used in the past you’ll go cold with anxiety over whether or not your building it properly. When you go back and ask who it was built in the past someone might say “Don’t you know how to compile software?” and the answer is likely going to be “Yes I do, but I don’t know how YOU compile software.” Whats important is that the person implementing a procedure be given all the information and context necessary to understand, and if necessary, interpret the information as appropriate for the given situation.

The first key to better SOP’s is to provide a template for others to follow. Without a standard template each author will write the procedure in their own unique style. Some people will write you a book, others will just paste some lines from their terminal into a code block. The template therefore must enforce a certain flow that ensures we include all the needed information but in a concise and complete way. Plus, we want SOP creators to focus entirely on writing the content, not debating the format.

Here is the template I use for Joyent Operations SOPs (in Confluence markup):

* Author:  {page-info:created-user}  created at: {page-info:created-date}
* Version: 1
* Revisions: {page-info:current-version}
* Reviewed by: (User @ date)
* Time to implement: 1hr
* Products this applies to: (SKU1)

{toc}

h1. Description & Scope

h1. Prerequisites

* Root access to node
* [SOP-222: Something|SOP-222: Something]
* [SOP-224: Something else|SOP-224: Something else]

h1. Procedure

h3. Step 1: Do this

{noformat}
Example
{noformat}

h3. Step 2: Do that

h3. ...

h1. Procedure Validation

# Login and verify external connectivity (ping google.com)
# curl zone IP address, page returns
# etc.

h1. Notes/Jira Examples

* [http://confluence.atlassian.com/display/DOC/JIRA+Issues+Macro]
* [http://confluence.atlassian.com/display/DOC/JIRA+Portlet+Macro]
* [https://studio.plugins.atlassian.com/browse/CONFJIRA-154]

Lets step through the above template.

All SOPs must be numbered for easy reference. Even the template itself is SOP-000. The SOP title is in the form: “SOP-102 Creating LDAP Users”, for instance.

The top of the SOP is full of metadata. The author, creation date, major version number and number of revisions made and products (or projects or whatever) that this SOP applies to. You’ll notice 2 other fields: “Reviewed By” and “Time to implement”. These are perhaps the most important of all. After an SOP is created it must be reviewed by someone else in the group, preferably with as little knowledge of the subject as possible. They should read and follow the SOP as written, starting a timer when the begin and stopping the timer when they are complete… it is that stopwatch time which becomes the “time to implement”. This is extremely important, the time estimate for implementation by the author will be way too short because they know what they are doing, the time it takes a complete n00b will be more useful and truthful.

Moving on through the template, “Description and Scope” are where we provide context. What are we talking about, what does it entail at a high level, what does this impact, etc. We want to include as much information as possible to set the stage for the procedure that follows. Then we include a bulleted list of “Prerequisites”. The single most common part of any procedure that gets skimped on is the prerequisites and they are also generally the most time consuming.

The meat of the SOP is the procedure itself. I strongly believe these must be in a “Step 1… Step 2… Step 3″ format; it must be easy and intuitive to follow and in some cases may be used as a checklist during sensitive procedures. Its important that these truly start at the beginning and go to the end. “Step 1: Login to server X” may be overly simplistic but necessary for clarity if multiple machines are involved. I also like to have the final step be “Done” to make it clear that you have reached the end.

Just as important as the procedure is the “Validation Steps”… to ensure a quality job we must not only preform the proceedure but validate it in one or more ways to ensure it was really done right. This has the added side effect of giving the person doing the work the satisfaction that it was done properly and they didn’t screw something up along the way.

Lastly is a place to include external links as appropriate. If possible I like to link in tickets (we use Jira) which have relied on the SOP before, so that if by chance there is some confusion they can find examples of the work being done in the past.

An optional section that I’ve used before is a “Rationale”. In this section you would include notes on why you chose to implement the procedure in the way that you did. This allows for continuous improvement of the SOP. In most cases there are many ways to solve a problem, conveying why you chose the method you did will help you hone the procedure in the future while learning from the past. Without it your likely to have regression or duplication crop up.

This is the model that we’ve used at Joyent for several years and it has stood the test of time. I believe it to be a very solid standard for writing SOP’s and sharing knowledge within the organization and avoiding any one single person becoming a constraint. If you have refinements or a better method, I’d love to learn about it.

DevOps LA on Aug 27th

Saturday, August 18th, 2012

I’ve been invited to speak at  DevOps LA on Monday, Aug 27th.  The title I’ve chosen is “The DevOps Transformation” but it will not be the talk I gave at LISA. Partly because I’ve already given that talk, and partly because I only have a 20-30 minute speaking slot.  I’ll be looking beyond those fundamental principles and considering some new material, including work flow, routing, and conversion with the LEAN world at large.  I will not discuss ITSM at length again. :)

I’ve also decided this is an amazing opportunity for me as an OpsDad… given that its a 24 hour trip for me (OAK to LAX is 1.5hrs flight time) I’ll be bringing my two eldest children (Nova 8yrs & Glenn 7yrs) along.  As professionals its not often we can bring our children with us on a “business trip”, so I’m looking forward to giving them a really kool adventure.  I strongly encourage all OpsDad’s included their kids whenever you have the ability to do so.

UPDATE: The event went great, thanks to everyone who attended.  For those wanting the slides, here is the deck: DevOps Demystified

Konsidering Kanban

Monday, June 4th, 2012

Kanban has become an increasingly popular “agile” technique which is consider as similar to Scrum and Extreme Programming.  David Anderson created the agile form of Kanban from his experience in Japan.  In his book he tells the story of his visit to the Imperial Palace Gardens where there is no admission cost, but the flow of visitors is constrained by a stack of kanban (cards), people can enter until the kanban are exhausted, then as people leave and return the kanban someone new may enter.  From his experience he translated the technique into project management as a way to visualize and thus constrain the flow of work in progress (WIP).

WIP is the hidden killer of productivity among what Peter Drucker calls “knowledge workers”, most of us know this phenomenon as “multi-tasking”.  Multi-tasking is an illusion that a person is doing more than one thing at a time, however in reality most of the time you are in fact context-switching back and forth between multiple tasks very quickly and thus not fully engaged in either.  In some cases the consequences of multi-tasking are trivial, such as driving and talking on the phone, you are dividing your mental attention between both the conversation and your driving, but this is generally acceptable.  However, even in this common case, should either the driving situation or the conversation become serious, the facade of multi-tasking will reveal itself for the lie that it is; your attention can only be truly devoted to one thing at a time.  Think of a time you got into a serious discussion while driving… how much of the drive do you actually remember?

Anderson’s adaptation of kanban involves creating a card board by drawing rows and columns into which index cards or post-it notes can be placed.  Each kanban (card) represents work (a discrete task).  The rows  (swimlanes) are workers (a person, a team).  The columns are steps in the value chain representing state and have a work limit associated which limits how many cards may be in any given state at a given time.

In its most simplistic form, as shown above, there are 2 advantages gained: visualization and constraint of WIP.  Anyone can easily see how much there is to do, how much as been completed, and whats in progress, but most importantly we enforce our per-determined WIP limit.

Indeed truly constraining WIP is the strength of this system.  Any manager could create a “no multitasking” policy, but without visualizing the flow of work there is no way to enforce it.

But you probably already know all this, if your reading this, there are lots of “intro to kanban” blog posts and videos out there.  What few people know is what kanban is in a LEAN context… where it really came from and is intended to do.

Kanban was created by Taiichi Ohno as the primary signaling mechanism of the Toyota Production System (TPS) which enables “Just-In-Time (JIT) manufacturing”.  JIT is the evolution of Henry Ford’s “mass production” in which work was done in large batches and produced a lot of excess goods, aka waste.  The Toyota Production Systems foundational principles is on the reduction and elimination of waste (“muda”) so they decided to only build what they need when they needed it.. obviously there must be a buffer of finished goods inventory (cars waiting on a dealers lot to be sold), but they would keep that amount as low as possible.  So now, let me explain kanban again, but this time in the manufacturing (TPS or LEAN) context:

Kanban is Japanese for card.  At the assembly line sits a worker who puts the break assembly on the car.  The worker has a number of  “kits”, a plastic container which includes all the needed parts to build one break assembly.  The kits have kanban attached to them, so when she picks up a kit, she takes out the kanban and puts it into a tray and then builds the assembly.  Every hour a boy comes around and collects all the “used” kanban from the trays and goes to the parts dept.  The parts dept then refills the trays, attaches kanban to them, and sends them back to the assembly line.  So we’re effectively controlling and signalling the flow of goods in this closed loop. Here is an example of a real kanban:

 

But here is the really really important part… the kits are assembled from supplies of parts which have their own kanban, which control their flow.  So when a box of Part B is pulled from inventory to create a kit, that kanban signals the ordering of that part from the supplier.  The result is that instead of buying 200 of Part B every week because it matches our production schedule, we only order as many of Part B as we have kanban signaling us to do so.  At a Toyota plant, they don’t receive parts every week, they receive them every couple hours!  Kanban proliferated throughout Japan because when your using it for JIT, it requires that your suppliers do so as well to keep up with your speed… and their suppliers as well.  Here is a pallet of parts, look for the kanban attached:

So Just-In-Time manufacturing, with signaling from kanban, pulls along the entire supply chain from the final assembly line.  If you aren’t using the parts, they never get ordered… no mistakes.

Lets look at how this put Japan on top of the car industry in the 1970′s.  In 1973 there was an oil crisis, oil prices soared out of control.  People stopped buying cars.  The American mass production systems continued to build cars, which, thanks to “planned obsolesces” (new cars each year), caused a lot of those cars to go straight into scrap heaps.  American car makers suffered terrible losses all because the push based mass production system depends entirely on accurate forecasting, if its off you makes to few or, in the 1970′s, way too many.  Toyota, however, stopped making cars because people stopped buying cars… and because the supply chain only moves based on the speed of the final assembly line, it stopped too.  Toyota took losses on the small amount of inventory it had in the market, but didn’t continue to spit out cars until they were sold.  JIT saves the day thanks to kanban!

 

Both David Anderson’s Kanban and LEAN’s Kanban use cards, but I hope you can now see that the goals are very different.  In another blog post I’ll go into more detail on how we can apply LEAN principles to get the most out of your agile implementation.

Policy & Process in the Blood

Saturday, April 14th, 2012

I’m highly introspective… far more than I would actually like to be.  I’m one of those strange individuals to whom if you said “Do you realize your being a jerk right now?” I’d actually admit “Yes, I’m sorry about that, I’m trying to find a way to rectify it unsuccessfully.”

Despite that obsessive level of awareness, nothing can tell you more about who you are then your children.  In particular, by observing things your children do that you never taught them, they just started doing of their own accord because “it seemed right”.  Genetics at work.

I fight frequently with people about documenting processes.  But maybe I’m just anal?  Then the other day my son comes to me and shows me this:

This is Glenn, my eldest son (6 years old).  He wanted some lemonade, but mom and I were busy.  He decided it might help if he simplified his request into a process.  You can see here that we start with a bottle of lemonade, then we pour it into a glass, then WHAMO!  we have our amazingly refreshing beverage to enjoy.  It is the perfect process with an input, output, and processing in the middle.  Brilliant, and he hasn’t even been to business school yet.  How much simpler does process get?

What about policy?  Policy is just a business word for “rules”, nothing more.  In my opinion, the worlds most amazing and effective policy is this one:

That yellow line is policy.  Its not a brick wall, but we treat it like one.  Thanks to that little bit of paint two cars can drive towards each other at 70 MPH, passing with only 6 ft between them, without fear.  It doesn’t get simpler or more powerful than that.

Parents and authority figures in general, tend to layer into a child the concept of right and wrong as absolutes. Take the cookie and you shall be punished, so don’t take the cookie! All throughout our culture we do this, define a rules and corresponding punishments. The result is a general fear of rules, because they are seemingly there for the sole purpose of justifying punishment.

Any rule, any law, any policy, can be viewed as a guide or as a guillotine. When I asked many of my peers what they thought about policy a surprising number quickly answered “Its there so that you can fire people.” Its shocking how many people believe that. One would think that policy is there to enforce lessons learned in the past, as a guide for decision making, pre-computed solutions to problems which might be difficult to conflicting. So then why is it that they are considered simply a justification for punishment? Inconsistency of course… everyone seems to ignore, discount, or outright disregard policy on a day-to-day basis and it only comes to peoples attention when someone is being called out.

Policy and process are wonderful things. At least, they can be. They are the means by which we share knowledge within an organization. Common tasks, problems, and dilemmas can be quickly handled in a tried and true way, consistent throughout the organization, because we have policy and process. But in order for them to work, there are some ground rules, if you don’t follow them they are doomed to be the millstones of frustration most of us see them as:

  • They need to be simple and straight-forward for the average employee.
  • They need to be indexed, so that they can be easily found.
  • They need to be relevant to the business, not just copied from someone else.
  • They need to be consistent, so that they do not contradict each other.
  • They need to be helpful and solve real problems.
  • They need to be up to date. Old policy and process can be worse than none at all, because people are afraid of the reliability and may waste time debating a course of action, which is exactly what process and policy should speed up.

The last point is the hardest. Knowledge management is still something we’re shitty at. Wiki’s have helped a lot over the last decade by making everything searchable and empowering everyone to update documents quickly and easily. But the fundamental problem is that of scaling. Not scaling the infrastructure but of the human mind. Many a sci-fi story has depicted the person who desire to know everything, and when the wish was granted, their head promptly exploded in one way or another. In many large companies when you hire on you’ll receive a book or binder with all the company policies… did you read it? Of course not: tl:dr.

Thus, what we’re really talking about here is culture. Genetics. Your children get them from you in the blood, but in a company we must teach them to others through words and actions. Preferably when employees are new, through on the job training/mentoring/tasking. Will you ignore policy and process? If you don’t care, they are likely useless crap anyways, and everyone can fend for themselves and hopefully get it right. But what if instead they were useful, and they were a reference available to simplify life? You don’t read the dictionary, but you know that its there and handy when you need it… so should be process and policy.

I feel passionate about these things because I hate to see employees stressed out because they aren’t sure what to do or how to do something. Useless anxiety. Wasted energy. Muda. I see managers beat on their people for not knowing… but who’s fault is it really? There are hard problems in the world, lets focus the energy on new problems and codify what we’ve learned in the past for everyone to benefit from. This is the nature of continuous improvement… building a collective body of corporate knowledge and continuously expanding, refining, and even replacing it when appropriate.

LISA Keynote 2011: The DevOps Transformation

Friday, December 16th, 2011

Last week I was given the incredible opportunity to not only speak at LISA but to deliver the opening keynote.  I hadn’t expected to even go, but when I learned the topic was DevOps I made a last minute plea on the eve of the submission deadline for a slot to deliver a talk I was calling “The 60 Minute MBA”, a history of Operations Management.  My hope is that I could get some obscure timeslot so a handful of people could geek out with me on Operations Management and LEAN and how it is helping to fuel and direct a lot of the DevOps thinking out there.  To my great shock I was told I was given the keynote slot… frankly, something I didn’t want for fear of the stress associated with it, but Tom felt I should step up and that I’d do great.

I haven’t blogged much in the last year and when I have its on topics you probably wouldn’t expect from a “Solaris blogger”.  I’ve held back most of what I want to talk about and only let the cream rise to the top.  My already frantic reading backlog only intensified as I was trying to pack as much into my talk as possible and ensure I was accurate.  Everything I read, watched, attended or did was reshaping my talk and I essentially spent 6 months “on stage” in my mind.   The problem I really had was that I had maybe 6 hours of content that I needed to condense into a 1 hour slot, hitting the essentials but not diluting its potency.  And, of course, I’m still learning every day.  Only 2 weeks prior to my talk did I finally hammer out a rough slide deck and I then had to keep pushing it around into something moderately cohesive.  Trying to find ways to address wisdom, systems thinking, agile, lean, TPS, OM and OR, and tie all this back to DevOps was a challenge.

To make things more challenging, Tamarah’s (my wife, seen above) due date for our 5th child is the 14th of Dec and the talk was to happen at 9:30AM Eastern time, which is 6:30AM Pacific and I’m not a morning person.  So… all things considered, I did pretty well, but you will notice in my talk that I was a little slower than I normally would be.  The upshot, however, was that I didn’t ramble much which kept me on my time marks.

What was interesting to me was what different people walked away with. Some people really keyed in on the value chain and asking “Why?”. Others wanted to rediscover ITIL because it was the first time they had heard it didn’t suck. Others got interested in operations management and LEAN, something they’d heard of but didn’t know where to start learning more. Others keyed on the collaboration of devops and bringing teams together. There was, I think, something for everyone and I didn’t hear any negative feedback on that talk beyond some people liking some parts and not caring about others… and it was designed that way.

Two things I want to note for viewers. First, when I said “by men I mean the human race”, I should have better explained that I think of “men” in a JRR Tolkien sense, the “race of man”. Secondly, at the very end I bagged on Sun TechPubs… I didn’t really explain myself and someone took offense to it. The fault was not on Sun’s writers, but rather on the engineering managers who wouldn’t permit writers the access to engineering they needed, so TechPubs was left to figure it out themselves. The fault was squarely on the engineering managers, NOT on the writers. Given the circumstances they have always turned out amazing documentation and I have nothing negative to say about the writers (as I noted in my answer, I wanted to be one at one time).

Anyway.  The following is the keynote, the slides can be found here.


 

I referenced a lot of books, and may have asked for the list of books, so here it is.

Please note! I do not profit from any of this in any way, I’m not getting a book kick back or whatever.  My only source of income is my Joyent salary.

The Essential Books you should read to put DevOps, ITIL/ITSM, LEAN and Operations Management into perspective and educate yourself for the future:

  1. The Visible Ops Handbook: Implementing ITIL in 4 Practical and Auditable Steps
  2. Any Operations Management textbook
  3. Web Operations: Keeping the Data On Time
  4. Lean IT: Enabling and Sustaining Your Lean Transformation

The Advanced Books you can read to dig behind the ideas, this is my “Best Of” list:

  1. My Philosophy of Industry & Moving Forward Henry Ford
  2. Today and Tomorow Henry Ford
  3. The Principles of Scientific Management Taylor
  4. The Toyota Production System Ohno
  5. Out of Crisis Deming
  6. The New Economics Deming
  7. Management Challenges for the 21st Century Drucker
  8. The Goal Goldratt
  9. Critical Chain Goldratt
  10. Creating the Corporate Future Ackoff
  11. Future Shock Toffler

One book mentioned in my talk that I do not own, nor have I read, is Lean Startup by Eric Ries, which is based largely on The Four Steps to the Epiphany a book I did buy at the MIT Press bookstore after my keynote. “Lean Startup” is popular, but all he’s really doing is applying LEAN concepts and Agile methodologies to the startup. There are hundreds of “Lean XYZ” books. I am personally interested in the real deal, not books about other books. “LEAN IT” is my one exception because it can be a big time saver and I feel it gives proper credit to the history and sources of the ideas it espouses.

Finally, rather than give you a “fire hose” list of everything, I’ll simply include a picture of what I feel is a very complete libary on these various topics.  The handful of books missing from these shelves are PDFs on my iPad such as the official  “ITILv3 2011 Update”, several books on Engineering Systems, etc.  Click the image to see it high-res.

 

Using Graphite to Graph DTrace Metrics: Part II

Monday, November 14th, 2011

In a previous entry I described Graphite and gave an overly simplistic example of integrating it with DTrace… lets get a little more serious and see what fun we can have.

For a years a problem nagged at me.  I wanted to get really fine grained latency information from an NFS server to track user experience.  This isn’t an easy thing to do, especially for hundreds of exports.  First off, you have to use DTrace to get that kind of data, there isn’t really any other way to find per operation latency on a per export basis.  Secondly, writing all that data into local RRDs is a massive I/O problem in its own right.  Thirdly, graphing the data once its in RRD isn’t hard, but creating summary “rollup” graphs (ie: all instances of a metric in a single graph)  requires righting scripts that mush all the individual RRDs together, which of course is a pain in a dynamic environment.  And thats just for starters.  When you dig deeper into this problem you just find other, smaller, problems.

Many solutions were tried but only Graphite made the final cut.  In particular the fact that its network based with no agents, databases are dynamically created so if new instances come or go the system simply adapts and there is no administration required, and most importantly, Graphite creates graphs based on a simple “URL API”.   This all means that that we can dynamically add metrics to Graphite and just as easily we can dynamically graph them and that means we can get maximum power out of DTrace’s ultimate weapon: the aggregate!

So our goal is to create DTrace scripts which output aggregates that are then transported into Graphite. There are many ways to do this, but I really wanted something that could be controlled by a single script and managed via SMF. After several iterations I arrived at a solution using PERL that forks dtrace scripts which feed data via STDIN to a helper script to parse and transmit the data to Graphite. Lets look at each piece.

First, the control script. This simply forks the DTrace scripts and pipes STDOUT to STDIN of the helper script.

#!/usr/perl5/bin/perl
#
# Control script for Per-Export NFS Latency Graphite Metrics
# 

@SCRIPTS = ('read','write');

foreach $i (@SCRIPTS) {
        if (my $WORKER = fork()) {
                print("Forking of PID $WORKER for $i I/O\n");
                exec("./nfsv3-latency.d/nfsv3-${i}-latency.d |
                     ./nfsv3-latency.d/graphite-nfsv3-assist.pl ${i}");
        }
}

The DTrace scripts are very simply, we trace entry and return of rfs3_read (the server side function for processing NFSv3 reads) and load the export path and latency in ms into an aggregate. Every 10 seconds, we output and clear the aggregate.

#!/usr/sbin/dtrace -s

#pragma D option quiet

rfs3_read:entry
{
        self->time = timestamp;
        self->start = 1;
        self->export =  stringof(args[2]->exi_export.ex_path);
} 

rfs3_read:return
/self->start == 1/
{
        this->elapsed   = timestamp;
        this->ms        = (this->elapsed - self->time)/1000000;

        @read[self->export] = avg(this->ms);

        self->start = 0;
}

tick-10sec
{

        printa(@read);
        trunc(@read);
}

The write DTrace script is the same, just substituting in “write” instead of “read”.

Now for the helper script that parses the aggregates and transmits the data to Graphite. Here we create a TCP session to the Graphite server, parse the STDIN into its 2 components, which in this case are export path and latency, then do some sanity checking to make sure data looks correct and finally send the key/value pair to the Graphite server:

#!/usr/perl5/bin/perl
#
# GraphiteAssist v0.1
# 
#
# The primary purpose is to provide a way
# for DTrace Aggregates to be injected into Graphite
#

use IO::Socket;

## Default Values:
my $GRAPHITE_SERVER = "graphite.server.com";
my $GRAPHITE_PORT   = 2003;

if ( ! $ARGV[0] ) {
        die("USAGE: $0 \n");
}
my $METRIC = $ARGV[0];

my $HOSTNAME = `hostname`;
chomp($HOSTNAME);

## Prep the socket
my $sock = IO::Socket::INET->new(
    Proto    => 'tcp',
    PeerPort => $GRAPHITE_PORT,
    PeerAddr => $GRAPHITE_SERVER,
) or die "Could not create socket: $!\n";

while() {
  chomp($_);
  $_ =~ s/^\s+//; # Trim any leading whitespace
  my ($EXPORT,$VALUE,$OTHER) = split(/\s+/, $_, 3);

  ### Sanity check on the input data
  if ($OTHER) {
       # print("I got some other crap here: $OTHER (Input: $_)\n");
        next;
  }
  if ($EXPORT !~ m/\w+/) {
       # print("Export looks wrong: $EXPORT (Input: $_)\n");
        next;
  }
  if ($VALUE !~ m/\d+/) {
       # print("Value looks wrong: $VALUE (Input: $_)\n");
        next;
  }

  my $KEY = "joyent.${HOSTNAME}.exports.${EXPORT}.latency_${METRIC}";

  $DATE = time();
  #print("Sending: $KEY $VALUE $DATE\n");
  $sock->send("$KEY $VALUE $DATE\n") or die "Send error: $!\n";

}

There you have it. We can take it a step further by controlling this via SMF, but I’ll leave that part as an exercise for the reader.

The scripts above are somewhat crude but they demonstrate the pattern here. You can use it to graph anything that DTrace can see, which is… everything. I’ve used this same pattern for monitoring VFS latency on a large scale, as well as MySQL query latency, and various types of throughput.

Its the Graphite URL API that really makes this powerful, because I can glob for keys. For instance, the following URL would render ALL export latency (read/write for each export) for the last 1 hour. (This is a single URL, but I’m breaking it a part a bit to make the various arguments passed to render clear.)


http://graphite.server.com/render/?

 width=800&height=600&
 target=joyent.nfs-server.exports.*.*.*.latency_*&
 tz=utc&
 from=-1hours

DTrace is a fabulous means of obtaining hard to get data, and Graphite is a fabulous means of graphing hard to graph data… combined they can accomplish almost anything.

Nothing New Under the Sun: An Introduction to Operations Management (OM)

Thursday, July 21st, 2011

8 All things are full of weariness;
a man cannot utter it;
the eye is not satisfied with seeing,
nor the ear filled with hearing.
9 What has been is what will be,
and what has been done is what will be done,
and there is nothing new under the sun.
10 Is there a thing of which it is said,
“See, this is new”?
It has been already
in the ages before us.
11 There is no remembrance of former things,
nor will there be any remembrance
of later things  yet to be
among those who come after.

 

Ever been irritated by the subtle but constant reference by Agile and DevOps people to manufacturing?  You may not even realize they are doing it, but you’ll hear reference to a book called “The Goal”, quotes from Deming, analogies to factories, etc.  In many conference talks I could feel that there was some larger body of knowledge that speakers were alluding to, but not fully describing.  What was this secret knowledge?  Last year I finally stumbled upon the answer and I’ve been consumed by it ever since… long time readers of my blog will note a considerable change in tone and subject since Dec of last year.

This secret body of knowledge that is all around you, but not directly named is “Operations Management” (OM).

Classically, it is said that a company is made up of 3 primary organizations divisions: Finance, Marketing (which includes Sales), and Operations.  Finance handles the books and internal resources, Marketing brings the market to the company and sells its products to that market, and Operations is the part of the company that does what your company does.  This is an overly simplistic model, but it makes a complex organization easier to grok.  If you run a hot dog stand, “operations” refers to ordering hot dog stuff, making hot dogs, serving customers, etc.  If you make cars, “operations” refers to the factory floor managing supply chain, operating the assembly line, and delivering cars to dealers.  If you run a web site, “operations” refers to the developers and sysadmins who make the product, run it, etc.  So again, the model breaks down to bean counters, sellers, and makers/doers.

Have you ever thought about getting an MBA?  I have.  Except, when I looked at the curriculum my eyes somehow danced right over OM, because I didn’t know what I was looking for.  Now I know.  You can examine the OM departments at Harvard Business School and MIT Sloan.  As with so many things today, the first step to knowledge is knowing what to look for, if you don’t know what its called you can search until your blue in the face and find nothing of real value.

My journey really took off when I found, at Church of all places, a donated text book entitled Fundamentals of Operations Management (4e).  “WOW!” I though, “that what I’ve been looking for!”  One look at the table of contents and I knew I’d stumbled onto the illusive body of knowledge I’d sought for so long:

  1. Introduction to Operations Management
  2. Operations Strategy: Defining How Firms Compete
  3. New Product and Service Development, and Process Selection
  4. Project Management
  5. The Role of Technology in Operations
  6. Process Measurement and Analysis
  7. Financial Analysis in Operations Management
  8. Quality Management
  9. Quality Control Tools for Improving Processes
  10. Facility Decisions: Location and Capacity
  11. Facility Decisions: Layouts
  12. Forecasting
  13. Human Resource Issues in Operations Management
  14. Work Performance: Measurement
  15. Waiting Line Management
  16. Waiting Line Theory
  17. Scheduling
  18. Supply Chain Management
  19. Just-in-Time Systems
  20. Aggregate Planning
  21. Inventory Systems for Independent Demand
  22. Inventory Systems for Dependent Demand

Jack pot!  If more than half of those chapters don’t seem pertinent to IT departments, then you’ve never tried to manage one.  The focus may be slightly different, but the core issues, problem domains, and related disciples are essentially identical.  This explains why so many “experts” are making reference to OM, knowingly or unknowingly, because in manufacturing they dealt with the same problems, in essence, we have in IT.  The Web companies (Twitter, Facebook, Flikr/Etsy, etc) are the ones leading the charge because more than traditional IT organizations, they really do look like the factory floor producing a single line of products.

So now… now I know what questions to ask.  And ask I did.  This opened up a whole new world to me that was right under my nose.  The Toyota Production System (TPS) which became known in the US as “Lean”… W. Edwards Deming and Total Quality Management (TQM)… ISO-9001…. the undertones of ITIL, CobiT, ISO-27001, and Agile…. it all came together and made sense for the first time.

This sent me into an epic journey as I sought out book after book after book by the cornerstone individuals of OM, because they all wrote books that formed the modern body of knowledge.  I now own all of Henry Ford’s books, Shigeo Shingo’s books, Taiichi Ohno’s books, W. Edward Deming’s Books, Walter Shewhart’s book, Fredrick Winslow Taylor’s book, Ludwig von Bertalanffy’s books, Peter Drucker’s books, and on and on and on.  I couldn’t stop buying and reading these texts that describe the world we find ourselves in today, shaped by the work they did so long ago.  All these points in my head started to be connected, one by one, and a fabric of knowledge appeared.

Friends, the point is this: there is nothing new under the sun.  Things change, evolve, and morph, sure, but the principles are not new.  If they were, we wouldn’t look back at Plato and Aristotle as wise today, much of what they debated 2400 years ago is still as pertinent today.  So it is with Agile and DevOps, the core principles have been well explored and addressed in the last century of manufacturing as part of Operations Management.  We only need adapt that knowledge, and the “experts” are doing exactly that.

Consider an example.  As a consequence of the innovations Ohno was introducing at Toyota in building the Toyota Production Systems (TPS, aka Lean), and in particular that of Kanban (the basis of Just-in-Time production, which is pull rather than push based production), he needed a way to speed up the “changeover time” (setup time) of large pressing machines.  These machines contain “die” which press sheet metal into, say, a car door.  The changeover time could be as much as 6 hours… that means, when you decide to stop making part A and want to make part B, you have to shut down for 6 hours to setup the machine for the new part before starting production again.  The way this was typically handled was to simply make a shitload of parts to build up a big inventory so that you reduced the likelyhood of needing to do another setup.  They were after local efficiency (what the “Theory of Constraints” calls local optima) at all costs.  This mass production method wasn’t going to work in Ohno’s new just-in-time world, the idea of stamping out only 20 parts and then changing to create another was completely idiotic.  At least, it was until he put Shigeo Shingo on the job.  It too Shingo years to make it happen, but ultimately he created a method know as “Single Minute Exchange of Dies” (SMED).  With his method you can change dies in less than 10 minutes (single-digit minutes, not 60 seconds).  This was the breakthrough that Ohno needed to make Kanban really work… and work it did.  With out SMED, a technology approach, to compliment Ohno’s other methods (Kanban, 5S, 5W, Andon, Muda, etc) Toyota just wouldn’t have been the industrial revolutionary that they became.

Now, why the hell am I telling you all that?  Look at what cloud did to IT.  Just like Kanban, Cloud came along and showed us that our setup times are way too long, and changeover from one type of setup to another was awful.  Configuration Management (CFengine, Chef, Puppet, etc) are the SMED of our industry.  Same problems, same needs, different solutions, but similar approaches.  There is no reason for us to re-invent all the wheels, alot of these issues are solved problems, if you just know where to look and what questions to ask, and have an open mind.

If you are like me and have been looking for something, but you know not what, go find yourself a book on Operations Management and get your journey started.  You’ll have a massive head start over all your peers who won’t figure this out for another couple years (just as others already got a head start over us).

Three Aspects of DevOps: What’s in a word

Friday, June 24th, 2011

Cloud.  DevOps.  Both are in the fad category, but both are very popular and everyone is grasping at what they really are.  There is a subtle difference however.  “Cloud” is ambiguous, this leads to the never ending line of questions “What is Cloud?”  and yet more as the concept evolves such as “If cloud means in the cloud, then isn’t private cloud an oximoron?”  DevOps on the other hand seems deceptively intuitive.  This has caused confusion in the ongoing conversation because different people mean different things.

I see it as 3 distinct definitions and I’m going to lay them out to help people start to refine their thinking.  To facilitate this I’m going to take dev and ops as keywords and add an operator.  The operator determines which methodology is adopted by the other camp.

Now let me go a step further and suggest that these are not simply aspects of devops, they are in fact the 3 phases of what is collectively the “devops transformation”.

Phase I: Dev > Ops

In this phase developement methodology and mentality are adopted by operations.  My estimate is that this represents about 90% of the devops movement.  This where the DevOps movement started and where most of its focus is today.  Several things happen in this phase:

  • IT groups and systems/network administrators re-realize themselves as “operations”.  Let us not forget that this is a new concept to many people, they don’t think of themselves as “operations” they think of themselves as IT.  If you’re running a website this is a fairly natural fit, but for traditional IT groups this is an area of contention in and of itself.  In the enterprise space you’re not “operating” a website, your “operating” a business.
  • Agile is slowly adopted and adapted into operations.  In many cases this means striping agile down to its first principles and its Lean roots.  Its a matter of taking existing practices such as ITIL and marrying them with agile principles.  This is slow and individually tailored to each company as many folks have found that things like SCRUM don’t work for ops, but visual workflow and control of work in progress (meaning, the inappropriately named “Kanban”) do.  Finding balance between Peter Drucker’s “doing things right and doing the right thing” takes time.
  • Re-tooling for the virtualized world.  I could say “cloud world” but thats inaccurate, since the problems are the same if you have a large internal VMware deployment or an external AWS deployment.  This is where most of the action has been in DevOps so far and what the DevOps Toolchain Project has been about.  This is the draw, in particular, to configuration management (in the automation sense, not the ITIL one) and is helped along by the 3 companies really driving the publicity of DevOps, those being Puppet Labs, OpsCode and DTO Solutions.
  • Monitoring gets kicked up a gear.  Just as virtualization causes you to re-evaluate your tools for configuration management and command-and-control (now being called “distributed orchestration”) your monitoring needs to step up to the new challenges as well.  This is where you will challenge your existing monitoring system, expand its functionality and re-consider your logging and trending strategies.  Maybe everything is up to snuff already, but with all the recent additions to the alerting/logging/trending category you’ll inevitably try some new things and get over your fear of using tools written in Ruby. :)
  • etc.

Phase II: Dev < Ops

In this phase operations methodology and mentality are adopted by developers.   My estimate is that this represents less than 10% of the devops movement.  This phase generally represents the bonding of the two groups and is easily confused with Phase I.  Things that happen in this phase are:

  • Metrics everywhere.  This is something championed by John Allspaw, the collection of metrics from everywhere.  In Phase I you may have started collecting metrics but they were by Ops for Ops, however in this phase Dev is actually interested in the metrics and they are more business focused, so metrics aren’t just coming from the OS but are also coming from application code.  This is where dashboards start to be created to facilitate the wide absorption of metrics.
  • Continuous Integration is implemented or evaluated.  Ops alone can’t implemented CI, so so inevitably cooperation forms around it.
  • Cross-training of tools and practices.  This is when developers take a genuine interest in day-to-day operations activities, challenges and start aligning the toolsets between both groups.
  • etc.

 

Phase III: Dev <> Ops

In this phase developers and operations unify, sharing responsibilities and practices.  I think this is an underlying principle of the movement, and frankly is what DevOps really is about.  This is the magical destination of your journey, a far country where Adam Jacobs rides unicorns and children spend time with their OpsDad’s and everyone sings drinking songs together at the pub.  The DevOps movement is so concentrated on Phases I and II that this is still an uncrystalized space, but it is what you are driving towards, therefore the things that emerge in this phase are:

  • Fully shared responsibility in a “no finger-pointing” environment.  Dev may built it, Ops may deploy and maintain it, but both parties are fully committed to success and personal responsibility doesn’t end after the code is committed or deployed or whatever.  This is where post-mortems involve both teams, this is where performance problems are solved in dev/op pairs working together.  When things come up, you don’t just re-assign the ticket, you get together and work it as a single unified team.
  • Developers are on-call.  This is popularized by WebOps shops and is not directly applicable to enterprises, but the principle still applies.  There should be an on-call rotation in development for problems which may be attributed to code they’ve written and full accept themselves as capable first-responders.
  • Integrated Continuous Improvement.  At this stage there aren’t 2 teams anymore, there is large interdisciplinary team of professionals.  New tools, new practices, new projects, etc are presented to the whole team whether coder or sysadmin, so that both groups can bring their capabilities to the table as we continually improve.  Just as in TOC, we do not want Inertia (ie: “the new status quo”) to slow us down by making us complacent.
  • etc.

 

Framing the Conversation

Are you in the midst of a “DevOps Transformation”?  More likely the vast majority of you are just modernizing your existing operations practices and tools.  Perhaps your usual weekly dev and ops meetings have simply been renamed “DevOps Weekly Meeting”.  You might think that Phase III is impossible in your environment but there is some interesting things in I and II.  Everyone is in a different place, but there is a natural and inevitable progression here.  The first step on that road is changing the culture, and the DevOps movement has caused that to happen.

From this model you can see why using “DevOps” as a title or job description is problematic.  Which of the 3 do you mean?  Do you mean a sysadmin in Phase I who is up on the new wave of operations tools and practices?  Do you mean a developer in Phase II who is using feature-flags and continuous integration?  Do you mean any IT worker who is culturally savvy to working together and serving the common business needs?  Who are you talking about?

The parts that make up what “DevOps” is are not new.  Whats new is the culture shift that is being accepted in places it wasn’t previously.  In years past employees that tried to “cross the lines” would often be beat down as being overly nosy or over-achievers or whatever.  DevOps may be a fad, just as cloud was, but its opened up a world of possibilities that were previously closed to us.  Once using AWS was unthinkable, now its almost expected.  Once trashing Tivoli for an Open Source solution was crazy, now its welcomed.  Once asking dev for metrics in code was laughed at, now its applauded.  So embrace this time in the history of our industry and seize the new opportunities by keeping that conversation alive.