Archive for April, 2011

The Joy of Non-Functional Requirements

Saturday, April 30th, 2011

ITILv3′s “Service Design” book, in section 5.1 regarding Requirements Engineering defines 3 types of requirements:

  • Functional requirements are those specifically required to support a particular business function.”
  • Management and operational requirements (sometimes referred to as non-functional requirements) address the need for a responsive, available, and secure service, and deal with such issues as ease of deployment,
    operability, management needs and security.”
  • Usability requirements are those that address the ‘look and feel’ needs of the user and result
    in features of the service that facilitate its ease of use. This requirement type is often seen as part of
    management and operational requirements, but for the purposes of this section it will be addressed separately.”

Later in section 5.1.1.2 several categories of these “Management and operational requirements” are presented, including: manageability, efficiency, availability and reliability, maintainability, security, controllability, measurability and reportability, etc.

Non-Functional Requirements (NFR) are generally equivalent to operational “technical debt”.  Every organization has some amount of this debt.  That debt can have practical explanations such as lack of resources (ie: staff, expertise, cash, etc) or simply be a result of geek perfectionism, afterall there is always something more that can be done.

I like the phrase “Non-Functional Requirements” because it adequately sums up the life of a sysadmin.  Your job is generally to identify and implement all the “things” that need to be there but no one really cares about until there is an emergency.  Backups, security, monitoring, synchronization, performance, capacity planning… burdens that many managers don’t want to be bothered with until its too late.

The news is buzzing about two examples of NFR biting companies in the butt… in particular the PlayStation Network (PSN) outage and the Amazon Web Services (AWS) EBS outage.  Both examples are easy to criticize, but consider what technical debt you have in your infrastructure.  Do you actually have a list of it all?  Do you review your infrastructure for NFR’s on an ongoing basis?

In the context of DevOps, you see a natural divide in requirements, dev tends to be concerned with functional requirements.  What the solution does or does not do.  Ops is then left to attend to all the various NFR’s after the fact, sometimes with very little guidance.  One thing we’ve heard from a great many DevOps guru’s is that Ops needs to be involved in development projects from day one… why?  NFR.

What gives me pause is that I’m certain that at both AWS and PSN there was at least one person who had a “told ya so” moment when disaster struck.  Rarely do things like this happen where everyone was completely blindsided by the event.  Which is why NFR’s are a key focus of Risk Management.

ITILv3 defines Risk as: “A possible event that could cause harm or loss, or affect the ability to achieve Objectives.  A Risk is measured by the probability of a Threat, the Vulnerability of the Asset to that Threat, and the Impact it would have if it occurred.”

What bothers me is that ITIL’s, indeed most peoples, definition of “risk” differs from the classical definition of Risk Management, which is to analyze all potential outcomes of a given decision, good or bad.  In IT, at least according to ITIL, we seem to over-focus on the negative.   Webster defines risk as: “possibility of loss or injury”, so ITIL isn’t wrong, but it may blind us from finding potential win-win outcomes.

The life of a sysadmin is the joy of non-functional requirements.  Those things that aren’t sexy, aren’t exciting, but indeed are requirements none the less.

I emplore you all, if you take any one things from the DevOps movement, to get operations involved in product development early to that you can solidify NFR from the beginning.  No employee should be burdened with having to make a personal decision about how often something is backed up or what is or isn’t monitored.  Combine functional requirements with non-functional requirements, do risk analysis and craft from that an SLA at the outset… because thats when people are most likely to care and have their minds in the right place.

Lastly, if you do do this, include as many people from the ops team as possible.  If only an ops manager is involved you are going to cut off a lot of potentially valuable feedback early when you need it, and you may have a very hard time motivating your ops team to get all those NFR’s implemented.  Sysadmins are almost never without something to do, so giving them a motivating sense of urgency isn’t optional.  That’s how that technical debt accumulates.  Just because something is important doesn’t mean its important “to me”, so it gets put off and off and off… and then you’re on the front page of the Wall Street Journal.

ZFS Backup & Recovery using Hadoop HDFS

Monday, April 18th, 2011

Hadoop HDFS has essentially become the de facto standard in cluster file systems.  I’m theory I’m a big fan of Lustre; I say “theory” because it never got ported to Solaris, despite the fact that Sun bought Lustre.  But thats a different story.  HDFS is extremely portable and well supported by a thriving community who are doing anything you can image with it.

Consider a large cluster of production nodes.  They almost certainly have unused disk space, and its probly pretty fast disk.  Wouldn’t it be nice if we could aggregate all that disk together for backups?  With HDFS we can.  Setting up HDFS is pretty well documented (I can do it for Solaris users if there is a demand, but its pretty clear), so see how easy it is to get ZFS backups in and out of HDFS.

Here is a DFS report from my HDFS test setup (single node):


root@newton hadoop$ hadoop dfsadmin -report
Safe mode is ON
Configured Capacity: 2942589468672 (2.68 TB)
Present Capacity: 1019145412608 (949.15 GB)
DFS Remaining: 1017809015808 (947.91 GB)
DFS Used: 1336396800 (1.24 GB)
DFS Used%: 0.13%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)

Name: 127.0.0.1:50010
Decommission Status : Normal
Configured Capacity: 2942589468672 (2.68 TB)
DFS Used: 1336396800 (1.24 GB)
Non DFS Used: 1923444056064 (1.75 TB)
DFS Remaining: 1017809015808(947.91 GB)
DFS Used%: 0.05%
DFS Remaining%: 34.59%
Last contact: Mon Apr 18 14:18:30 PDT 2011

 



So HDFS is ready to rock. Now lets create a ZFS dataset and populate it with some data:


root@newton ~$ zfs create -o mountpoint=/backup_test quadra/backup_test
root@newton ~$ cp *.pdf /backup_test/

 

Now, lets create a directory within HDFS to put our backups in:


root@newton ~$ hadoop fs -mkdir /zfs_backups
root@newton ~$ hadoop fs -ls /
Found 11 items
...
drwxrwxrwx   - root supergroup          0 2011-03-05 00:07 /hypertable
drwxr-xr-x   - root supergroup          0 2011-03-02 12:27 /system
drwxr-xr-x   - root supergroup          0 2011-04-18 14:22 /zfs_backups

 

Ready to rock. So lets actually do the backup. We’re going to create a snapshot and the zfs send it to the stdin of “hadoop fs -put”. Once we’ve done that, we’ll delete our origonal ZFS dataset:


root@newton ~$ zfs snapshot quadra/backup_test@051811
root@newton ~$ zfs send quadra/backup_test@051811 | hadoop \
> fs -put - /zfs_backups/backup_test.051811.zdump
root@newton ~$
root@newton ~$ zfs destroy -r quadra/backup_test
root@newton ~$ zfs list quadra/backup_test
cannot open 'quadra/backup_test': dataset does not exist

 

OK, so the ZFS snapshot has been stored as a file within HDFS and we destroyed our dataset. Now, lets recover it using the reverse proceedure:


root@newton ~$ hadoop fs -get /zfs_backups/backup_test.051811.zdump \
>  - | zfs recv -d quadra
root@newton ~$ zfs list -r quadra/backup_test
NAME                        USED  AVAIL  REFER  MOUNTPOINT
quadra/backup_test         32.3M   316G  32.3M  /quadra/backup_test
quadra/backup_test@051811      0      -  32.3M  -

 

Notice that we lost our properties during the receive, lets fix that and check that our files are back:


root@newton ~$ zfs set mountpoint=/backup_test quadra/backup_test
root@newton ~$ ls -l /backup_test/
total 33076
-rw-r--r-- 1 root root   330028 2011-04-18 14:20 Deployment_Guide_for_HP_ProLiant_Servers.pdf
-rw-r--r-- 1 root root    88378 2011-04-18 14:20 GeekBench-Receipt.pdf
-rw-r--r-- 1 root root   101243 2011-04-18 14:20 HP_ProLiant_Health_Monitor_User_Guide.pdf
-rw-r--r-- 1 root root    90844 2011-04-18 14:20 HP_ProLiant_Support_Pack_User_Guide_861.pdf
-rw-r--r-- 1 root root   123419 2011-04-18 14:20 inthebeginning.pdf
-rw-r--r-- 1 root root 21337122 2011-04-18 14:20 Jurans Quality Handbook.pdf
-rw-r--r-- 1 root root  1338119 2011-04-18 14:20 perc-technical-guidebook.pdf
-rw-r--r-- 1 root root  2401352 2011-04-18 14:20 PowerEdgeR510_Technical_Guidebook[1].pdf
-rw-r--r-- 1 root root  5923274 2011-04-18 14:20 R710-HOM.pdf
-rw-r--r-- 1 root root  1103929 2011-04-18 14:20 server-poweredge-r710-tech-guidebook.pdf
-rw-r--r-- 1 root root   504059 2011-04-18 14:20 TheQualityTrilogy.pdf

 

Please note that the reason the dataset properties were not retained was because I’m using an old ZPool (Version 18). If your running a newer pool version (check with “zpool get all pool_name”) the properties will go with the backup stream.

So there you go. Backup and recovery using ZFS send/recv to and from HDFS. Straightforward and easy to implement.