Archive for February, 2009

ZFS in the Trenches: Presentation from OpenSolaris Storage Summit

Tuesday, February 24th, 2009

And so ends another OpenSolaris Storage Summit. Turnout was great and despite the fact that it was strategically attached to the FAST conference, I was shocked how many people flew in from around the world specifically for this one day event. A big thanks to all who attended.

I presented a lightening fast “ZFS in the Trenches” talk, which is guaranteed to be the most indepth ZFS talk you’ve ever heard in 30 minutes. When I gave the talk it seemed like I was flying so fast that it was incomprehensible but in review it came off pretty well.

You can view the video here, drop down to the 2:20 mark to view. You can also download the slides here (PDF).

I’d like to extend a warm thank you to all the great people that I had the pleasure of speaking to, and I have to brag that I finally met the StorageMojo, Robin Harris and the hero of Sun PAE Roch, thus filling my celebrity introduction quota for the event.

LinkedIn Replaced The Resume; or 21st Century Relevance

Monday, February 23rd, 2009

I’ve embraced the reality more and more that LinkedIn has replaced the resume. This post on Slashdot convinced me that its worth saying publicly.

For what its worth, I don’t like LinkedIn. It limits the data that you can present thus causing a lot of what would be on a resume out. Such things include an inventory of skills, various extra work accomplishments, publications, and the like. Its a little too focused on job history and education. But I admit that I also never finished my degree, when the “bubble” started in 1998 and I was in school just 50 miles north of Silicon Valley I wondered what happened to those who stayed in school during the rise of other past revolutions… I dropped out and road the wave of progress. A fact easily omitted from a resume but less so with sites such as LinkedIn which consider my profile eternally 90% complete.

LinkedIn did for resumes what Google did for physical reference material… sure, you could sift through your pile of notes, but face it, searching google is easier and faster. So it is with people, if you find an applicant, I think many people are more likely to look them up on LinkedIn or google their name than read their resume.

Is the resume dead? No, its still handy in interviews… but it ends there I think.

That brings us to the big point…. if you don’t have an online presence I don’t think your relevant in the 21st century. We live in the Internet age. If your in the IT Industry and your name doesn’t return valid results from Google you are simply irrelevant. You’re not participating. You’re not engaging. You’re not leveraging mailing lists and the external community. And that is a bad sign indeed.

I won’t go so far as to say you need to have a blog or website, but I think it shows that you take your profession seriously. It certainly helps.

I’ve always been amazed that several “big names” in the SA world are void. I tried to dig up information on the heads of SAGE and LOPSA… not much. What that says to me is that these people big wigs in certain select circles. They have earned names and respect for themselves, but not for anyone to see, just those that matter. And who is that? I don’t mean to pick on people involved with LOPSA and SAGE, there are lots in this camp, but they struck me as odd given their position.

Knowledge and experience are things best when shared. I’m a Christian and an Free Software advocate. I believe that you should love your neighbor and there shouldn’t be a price tag on it. Share the love… and share it as widely as you can, in whatever way you’ve been gifted to do so.

When it comes to pragmatic job hunting… if I’ve got two potential candidates, one has a blog and is actively participating in the community (whatever community or committees those are) and another who has no presence on the net, which do you think I should choose?

There are only two reasons to hide. The first is fear of failure. Well, you can’t succeed if you don’t try. The second then is privacy… well, sorry, in the age of social networking staying under the radar is essentially impossible, and so if your actively avoiding it, your just starting to look dusty. As a commenter on the Slashdot thread rightly pointed out, the phone book invades your privacy more than Twitter or LinkedIn. Sure, you may know that I’m getting coffee right now or that I worked at MCI Systemhouse, but at least you don’t have my phone number or address.

Solaris Spit & Polish

Tuesday, February 10th, 2009

An interesting discussion has been taking place on the OpenSolaris SysAdmin Community list, and I sense it will lead us toward some important changes in Solaris. Essentially it all comes down to the lack of spit and polish. What has always been something we perhaps ignored or downplayed has become far more starkly contrasted by truly easy to use yet complex things such as ZFS or SMF.

The clearest examples are technologies that currently are essentially useless without custom scripting. Such examples include LDAP, Extended Accounting, and BSM Auditing.

LDAP is one that’s really concerned me. Almost any Solaris environment would benefit greatly from an LDAP/Kerberos implementation, for ease of management and increased security… but frankly, just dropping in a directory server and authenticating to it isn’t so straight forward. Populating and maintaining the DIT is complex, commonly requiring custom scripts and possibly a 3rd party LDAP Browser. While the aging idsconfig script is suppose to jumpstart your experience, its not perfect and is tailored to Sun DSEE. In the community we commonly see people scratching their heads wondering if other directory servers, such as OpenLDAP even work with Solaris and how to get started.

Microsoft hit a home run with ActiveDirectory, and it pains me in the same way that NetApp kicked Sun’s ass at building NFS servers. Sun is a systems company and the leading provider of directory/identity management products, but if you want to use them in conjunction with Solaris you’ve got a lot of custom work to do. As far as Kerberos, most of the use continues to be in academic environments, which means that the best means to secure NFS in a corporate environment just isn’t used.

Sun is very good at engineering the big things, but I’ve noticed that when it comes to connecting all the dots they tend to turn toward the path of acquisition. A need arises for a management app or something, they find a decent software company doing it, aquire them, and then slowly let the thing rot. I mean, how many people still use Sun Management Center or N1 Provisioning Server? (Or ever did for that matter.)

A lot of focus has gone into the GNU-ification of Solaris and improving the desktop experience with Indiana… I mean OpenSolaris… but at some point we’ve got to get back around to focusing on what Solaris does best, being the enterprise class server operating system we know and love.

This is especially important in the face of Cloud Computing. The cloud needs solid server operating systems, and Solaris leads the pack. If we’ve proved one thing with Solaris 10, its that making Solaris more like Linux doesn’t have nearly the impact we hoped it would, but making the complex very simplistic and straightforward (ZFS, DTrace, SMF, FMA, …) is dramatic.

Monitoring, Management, and Infrastructure is what we need. Easy, quick, and powerful. We have the technology underneath, we just need to bring it all together.

What say you?

Solaris Extended Accounting

Monday, February 9th, 2009

Extended Accounting is one of the many under appreciated features in Solaris, and quite possibly the worst documented to boot. So, it’s time to fix that.

Accounting, in general, is a means of recording data about resource utilization, CPU in particular, with the intention to be used primarily for reporting and billing purposes. Conceptually it could be confused with Auditing (see I See You!: Solaris Auditing (BSM)), in that they both record activity, except that the scope is vastly different. With BSM you can see every syscall/process/etc. that is executed, by who, when, with what privileges, etc; but the purpose of auditing is to determine what happened, when it happened, how it happened, and by who in the context of security. By contrast, Accounting is interested in what ran, how long it ran, how much resource it consumed, and some basic related information in the context of resource utilization.

When used in conjunction with Solaris’s array of observability tools, from BSM Auditing, to Kstats, to Dtrace, it provides a useful low impact way or maintaining a historical view of resource utilization more detailed than Kstats but more permanent than DTrace and more targeted than auditing. With kstats you can graph CPU usage over time, but with Extended Accounting you can determine the breakdown of CPU usage within the same period, for example.

The first thing you need to know about accounting on Solaris is that there are TWO separate accounting systems. The documentation did a poor job of making this clear for a long time. The UNIX Accounting system, we’ll call it, has been around for over a decade and is a hodgepodge of ancient scripts. Its still there and documented in the Solaris Admin Guide. Like mainframes or NIS+, if you’re not already using it you shouldn’t start now. In this blog entry we focus purely on the newer Solaris specific Extended Accounting facility.

Extended Accounting was added to Solaris in Solaris 9 along with the excellent microstate accounting enhancements. Solaris maintains all sorts of amazing stats about processes that can be explored with proc tools, and the many *stat tools (mpstat, vmstat, etc.) via options uncommonly used. The extended accounting facility really just provides a framework in which these stats can be stored in a datafile for later use, rather than be discarded when a process terminated. When exit() is run a call is made to extended accounting, and if enabled for tasks or processes the data is dumped into a record.

The framework is extensible for use in a variety of ways. For instance, when using Solaris IPQoS you can use extended accounting to record network flow data. As part of Crossbow a wonderful suprise was a new non-IPQoS flow extended accounting resource which I previously discussed here: Crossbow for Christmas.

Getting start with Extended Accounting (exacct for short) is easily done using the acctadm command:

$ acctadm
            Task accounting: inactive
       Task accounting file: none
     Tracked task resources: none
   Untracked task resources: extended
         Process accounting: active
    Process accounting file: /var/adm/exacct/proc
  Tracked process resources: extended
Untracked process resources: host
            Flow accounting: inactive
       Flow accounting file: none
     Tracked flow resources: none
   Untracked flow resources: extended
            Net accounting: active
       Net accounting file: /var/adm/exacct/net
     Tracked net resources: extended
   Untracked net resources: none

Here you can see the various accounting types available: Task, Process, Flow (IPQoS), and Net (Crossbow). Each accounting type has a datafile to store records in which is user defined. A variety of different resources can be tracked for each accounting type, which can either individually selected or specified as either “extended” or “basic”, which are groupings of resources. To see what they are use acctadm -r:

$ acctadm -r
process:
extended pid,uid,gid,cpu,time,command,tty,projid,taskid,ancpid,wait-status,zone,flag,memory,mstate
basic    pid,uid,gid,cpu,time,command,tty,flag
task:
extended taskid,projid,cpu,time,host,mstate,anctaskid,zone
basic    taskid,projid,cpu,time
flow:
extended saddr,daddr,sport,dport,proto,dsfield,nbytes,npkts,action,ctime,lseen,projid,uid
basic    saddr,daddr,sport,dport,proto,nbytes,npkts,action
net:
extended name,ehost,edest,vlan_pid,vlan_tci,sap,priority,bwlimit,devname,src_ip,dst_ip,src_port,dst_port,protocol,dsfield,curtime,ibytes,obytes,ipkts,opkts,ierrpkts,oerrpkts
basic    name,devname,ehost,edest,vlan_pid,vlan_tci,sap,priority,bwlimit,curtime,ibytes,obytes,ipkts,opkts,ierrpkts,oerrpkts

To enable a given accounting type specify the file, resources, and then enable it with the -E option, like so:

root@quadra ~$ acctadm -f /var/adm/exacct/task task
root@quadra ~$ acctadm -e extended task
root@quadra ~$ acctadm -E task
root@quadra ~$ acctadm task
            Task accounting: active
       Task accounting file: /var/adm/exacct/task
     Tracked task resources: extended
   Untracked task resources: none

All the above is well documented and straight forward. The tricky part is actually using the data thats collected and dealing with long running processes.

As I stated before, the accounting records are recorded at process termination. That means that if you start a long running process, such as a MySQL daemon that runs for 6 months, you won’t see a record for it until its stopped. That sort of defeats the usefulness of accounting. Therefore, using the wracct command you can cause an individual process or task to write a “partial record”. In this case, partial differentiates if from a full record written at exit. So if you want to get accounting data for such processes, you’ll want a cron job that executes wracct against the process(es) at some given interval.

# wracct -i `pgrep arkeiad` -t partial process
#

There are two tools provided with Solaris to examine extended accounting data files. One is the lastcomm command which is like a shells “history” command, showing a history of commands run based on the process accounting data. The second is actually a demo C application found in /usr/demo/libexacct called exdump. This demo app is intended to help developers learn the extended accounting library (libexacct), but is used all too commonly by administrators to output the contents of an accounting file.

If you really want to take advantage of accounting data you’ll need to build your own custom tools using either the Exacct PERL module (written by Sun, found in CPAN, and part of the Solaris PERL distribution) or the libexacct C library.

Getting started with the PERL modules can be confusing initially. You can find documentation for the module on CPAN and some examples in the System Administration Guide: Solaris Containers-Resource Management and Solaris Zones.

The problem with the examples in the manual is that they mask some of the inner workings via the convenience “dump” function. Thus I present to you a modified version that is a little more explicit:

#!/usr/bin/perl

##
## This is a modified version of the 'dumpexacct' example from the
## 'System Administration Guide: Solaris Containers-Resource Management and Solaris Zones'
## In this version, rather than use the "dump()" convience function I instead navigate
## directly... this helps better illustrate how to actually use the Exacct PERL Module.
#
## benr@cuddletech.com - 2/7/09

use strict;
use warnings;
use Sun::Solaris::Exacct qw(:EXACCT_ALL);

die("Usage is $0 n") unless (@ARGV == 1);

# Open the exact file and display the header information.
my $ef = ea_new_file($ARGV[0], &O_RDONLY) || die(error_str());
printf("Creator:  %sn", $ef->creator());
printf("Hostname: %snn", $ef->hostname());

my $obj_counter = 0;

# Dump the file contents
while (my $obj = $ef->get()) {                                                          ## Get an object from the EA File and reposition to next.
        print("---------------- OBJECT $obj_counter -----------------------n");
        $obj_counter++;

        my $objectType = $obj->type();                                                  ## Return the object type (group or item)
        my $objectCatalogObj =  $obj->catalog();                                        ## Return a catalog object for this object
        my ($a, $b, $c) = $objectCatalogObj->value();                                   ## Breakout the catalog triplet

        printf("Object is: %s   -   Catalog: %s %s %sn", $objectType, $a, $b, $c );    ## Output the catalog and object type for the object at hand.

        if ( $objectType == EO_GROUP && $c == EXD_GROUP_PROC ) {                        ## If this is a PROC Group....
                my @objectList = $obj->value();                                         ## Return the sub-group which will contain actual items.

                foreach my $item (@objectList) {
                        my $itemsType = $item->type();                                  ## Get type for new object.. is it a EO_ITEM or EO_GROUP?

                        if ($itemsType == EO_ITEM){
                                my $itemCatalog = $item->catalog();                             ## Return the "Catalog Object" for this object.
                                my ($itemType, $itemGroup, $itemId) = $itemCatalog->value();    ## Breakout the catalog triple
                                my $itemValue = $item->value();                                 ## Get the item value itself.

                                print("ttId: $itemId tValue: $itemValuen");                 ## Now print just the catalog item id and item value.
                        } else {
                                print("ttERROR: Expected Item, got a group instead.n");
                        }
                }
        }

}

# Report any errors
if (ea_error() != EXR_OK && ea_error() != EXR_EOF)  {
        printf("nERROR: %sn", ea_error_str());
        exit(1);
}

The example opens an accounting file and returns a file object. Using that file object we can get some metadata and via the get() method we can walk the file, because after each get() it repositions to the next object. An accounting file is a collection of objects, which come in the form of groups and items. Groups contain yet other groups or items. Items contain data. Each object has an associated catalog which is a triplet that describe the object in the form: type, group, and id.

When implementing your accounting program you’ll generally be walking the objects, then navigating groups and items to get to the data you want. To get your footing as to the structure of groups and types of data available in each item use the demo exdump tool and then start playing around. You’ll find the /usr/include/sys/exacct_catalog.h header useful in deciphering the catalog values. The heavy use of double-typed variables can be confusing at first, but the header will help it make sense.

As you can see, utilizing extended accounting isn’t for the casual user… but if you can find value in the data it collects you’ll reap the benefits of your labor learning the PERL module. If you are serious, I’d recommend taking the example above to break out selected items and formatting or sorting them in a more palatable way. Here is a simple modified version that outputs one line per record for a small handful of items:

use strict;
use warnings;
use Sun::Solaris::Exacct qw(:EXACCT_ALL);

die("Usage is $0 n") unless (@ARGV == 1);

# Open the exact file and display the header information.
my $ef = ea_new_file($ARGV[0], &O_RDONLY) || die(error_str());
printf("Creator:  %sn", $ef->creator());
printf("Hostname: %snn", $ef->hostname());

## Print Header:
printf("%10s %4s %4s %5s %30sn", "Counter", "UID", "GID", "PID", "CMD");
my $idxCounter = 1;

# Dump the file contents
while (my $obj = $ef->get()) {                                                          ## Get an object from the EA File and reposition to next.
        $idxCounter++;  ## Count records.

        my $objectType = $obj->type();                                                  ## Return the object type (group or item)
        my $objectCatalogObj =  $obj->catalog();                                        ## Return a catalog object for this object
        my ($a, $b, $c) = $objectCatalogObj->value();                                   ## Breakout the catalog triplet

        #printf("Object is: %s   -   Catalog: %s %s %sn", $objectType, $a, $b, $c );   ## Output the catalog and object type for the object at hand.

        if ( $objectType == EO_GROUP && $c == EXD_GROUP_PROC ) {                        ## If this is a PROC Group....
                my @objectList = $obj->value();                                         ## Return the sub-group which will contain actual items.

                my $xxPID = "";
                my $xxCMD = "";
                my $xxUID = "";
                my $xxGID = "";

                foreach my $item (@objectList) {
                        my $itemsType = $item->type();                                  ## Get type for new object.. is it a EO_ITEM or EO_GROUP?

                        if ($itemsType == EO_ITEM){
                                my $itemCatalog = $item->catalog();                             ## Return the "Catalog Object" for this object.
                                my ($itemType, $itemGroup, $itemId) = $itemCatalog->value();    ## Breakout the catalog triple
                                my $itemValue = $item->value();                                 ## Get the item value itself.

                                #print("ttId: $itemId tValue: $itemValuen");                        ## Now print just the catalog item id and item value.

                                $xxPID = $itemValue if ( $itemId == EXD_PROC_PID );
                                $xxUID = $itemValue if ( $itemId == EXD_PROC_UID );
                                $xxGID = $itemValue if ( $itemId == EXD_PROC_GID );
                                $xxCMD = $itemValue if ( $itemId == EXD_PROC_COMMAND );

                        } else {
                                print("ttERROR: Expected Item, got a group instead.n");
                        }
                } 

                ## Now output pretty formated data:
                printf("%10d %4d %4d %5d %30sn", $idxCounter, $xxUID, $xxGID, $xxPID, $xxCMD);
        }

}

# Report any errors
if (ea_error() != EXR_OK && ea_error() != EXR_EOF)  {
        printf("nERROR: %sn", ea_error_str());
        exit(1);
}

Here is what the above looks like when run:

$ ./prettyproc.pl  /var/adm/exacct/proc | head
Creator:  SunOS
Hostname: quadra

   Counter  UID  GID   PID                            CMD
         2    0    0  6878                        acctadm
         3    0    0  6879                        acctadm
         4 1004  999  6883                         db2set
         5 1004  999  6882                             sh
         6 1004  999  6881                          db2fm
         7    0    0  6880                             sh

Please note that there should a lot of additional checking in the above, and you should differentiate between items that are full versus partial, but this should help you get started down the extended accounting road. :)

Photo Archiving

Thursday, February 5th, 2009

I’m a storage guy and a father. Coming of age in the digital era means that I’ve never taken a picture on celluloid, I used my first bonus at MCI Systemhouse to buy my first real camera which was a floppy disk based Sony Mavica (which solved the early Linux/UNIX camera driver issue.)

Now, here I sit with dozens or perhaps hundreds of gigabytes of memories. One of the great joys of being into storage is that on one hand I’m aware of the wide variety of data solutions available… but I’m also aware of how fragile all these solutions are.

So I put it to my readers… what is the best method for photo archiving? We’re talking about pictures we want to see in 30 years.

One popular method is to use an online backup or photo archiving site, such as Flikr or SmugMug or StrongSpace or Box.net. But will these businesses be around in 30 years? It is possible that these services could loose the photos, and there isn’t much you can do about it.

A hedge would be to use multiple services, to have 2 archived copies. But that means active management of the data. You need to check in on things from time to time and ensure that its all there in tact.

Tapes are too expensive, so they are just out all together.

When I think about it, I can’t help but feel that the best solution is, frankly, to burn your images to optical disk (DVD) and store them in a bank vault (safety deposit box). You could go so far as to burn two copies and store both just in case, given that people stand behind optical disks for about 10 years and then its anyones guess… although we all, I think, agree that in a safe environment such as a bank vault degradation of optical media is unlikely to be a problem. DVD also is a format most likely to be around, in some form, in 30 years.

The biggest problem with DVD is the small capacity. BluRay is better, but its life is still questionable, especially for data storage. USB sticks or even hard drives present mechanical issues.. will USB be around in 30 years? will the filesystems still work then?

So I put it to you again… what do you think is the best means of storing long term personal data in large quantities?