Archive for July, 2008

Solaris Kerberos Revisited

Monday, July 28th, 2008

Some time ago I wrote a big blog entry entitled Simplifying Zone Management with Kerberos. Since that time (3 years ago!?!) several improvements and simplifications have come along to make life far more pleasant. If your new to Kerberos, please refer to my previous entry before proceeding with this one.

Kerberos is still something I struggle to really wrap my head around. Thats probably due to the fact that I have never used it in a full production deployment… the problem with using something only in the lab is that you never are “done” and you never hit real world problems that push you beyond that which you dream up.

There are classicly two reasons people get interested in Kerberos: Single Sign On (SSO) and secure NFS. Once upon a time general purpose encryption support for tools like telnet, rlogin, and FTP were a great draw, but today between SSH and SSL/TLS support in almost everything there just isn’t a reason to jump down the Kerberos path. SSO is a reason, but given that the defacto UNIX remote management protocol is SSH and SSH supports passwordless (authorized keys) login, its easy to mimic the benefits of SSO…. particularly if you use OpenSSH LPK, which allows you to store public keys in LDAP. As for NFS, well, there you probly still wanna stick with Kerberos unless you are willing to go the NFSv4/IPsec route.

The point here being, prior to SSH’s rise to glory Kerberos was a vitally important enterprise solution, whereas today its a hard sell. When it comes down to it, there is really only one real standout reason to seriously consider Kerberos: passwords are never on the wire, period. That is indeed a kool thing, but I doubt it will convince most people enough to really get them to go at it. In the end, Kerberos is only really exciting when its implemented in a way such as ActiveDirectory, where you don’t even know its there, you just turn it on and reap the benefits. So you can almost conclude that Kerberos is as important as ever, its just no longer worth the expenditure of effort.

To this end, I advise anyone interested in LDAP/Kerberos get familiar with ApacheDS, the work their doing is truly amazing. …but more about that some other time.

As I was saying, Kerberos is now easier than ever to get up and running on Solaris, thanks to the kdcmgr and kclient tools.

Thanks to the kdcmgr command we no longer need to hand edit files and use the kinit command to initialize a Kerberos KDC. Here is an example of kdcmgr in action:

$ kdcmgr -a benr/admin -r CUDDLETECH.COM create master

Starting server setup
---------------------------------------------------

Setting up /etc/krb5/kdc.conf.

Setting up /etc/krb5/krb5.conf.

Initializing database '/var/krb5/principal' for realm 'CUDDLETECH.COM',
master key name 'K/M@CUDDLETECH.COM'
You will be prompted for the database Master Password.
It is important that you NOT FORGET this password.
Enter KDC database master key:
Re-enter KDC database master key to verify: 

Authenticating as principal root/admin@CUDDLETECH.COM with password.
WARNING: no policy specified for benr/admin@CUDDLETECH.COM; defaulting to no policy
Enter password for principal "benr/admin@CUDDLETECH.COM":
Re-enter password for principal "benr/admin@CUDDLETECH.COM":
Principal "benr/admin@CUDDLETECH.COM" created.

Setting up /etc/krb5/kadm5.acl.

---------------------------------------------------
Setup COMPLETE.

Done! If you look around you’ll see this has created the necessary config and keytabs, as well as started the appropriate SMF services:

$ svcs -a | grep security
disabled       May_30   svc:/network/security/krb5_prop:default
online         May_30   svc:/network/security/ktkt_warn:default
online          3:11:34 svc:/network/security/krb5kdc:default
online          3:11:36 svc:/network/security/kadmin:default
$ ls -alh /etc/krb5/
total 40
drwxr-xr-x   2 root     sys          512 Jul 27 03:11 .
drwxr-xr-x  96 root     sys         4.5K Jul 25 14:58 ..
-rw-r--r--   1 root     sys           52 Jul 27 03:11 kadm5.acl
-rw-r--r--   1 root     root         998 Jul 27 03:11 kadm5.acl.sav
-rw-------   1 root     root        1.5K Jul 27 03:11 kadm5.keytab
-rw-r--r--   1 root     sys          430 Jul 27 03:11 kdc.conf
-rw-r--r--   1 root     root        1.3K Jul 27 03:11 kdc.conf.sav
-rw-r--r--   1 root     sys          968 Mar 22 10:43 kpropd.acl
-rw-r--r--   1 root     sys          367 Jul 27 03:11 krb5.conf
-rw-r--r--   1 root     root        2.0K Jul 27 03:11 krb5.conf.sav
-rw-------   1 root     root         383 Jul 27 03:11 krb5.keytab
-rw-r--r--   1 root     sys         1.1K Mar 22 10:43 warn.conf

As you can see, its all there and in order. You can immediately use kinit to get a ticket and use kadmin to create additional principles, no kadmin.local required. Very refreshing indeed.

As for the clients, manual configuration is so simple as to require very little improvement, never the less we can use the kclient command to standardize it:

# kclient -a benr/admin -k kdc1.cuddletech.com -R CUDDLETECH.COM

Starting client setup

---------------------------------------------------
Do you want to use DNS for kerberos lookups ? [y/n]: n
        No action performed.
kdc1.cuddletech.com

Note, this system and the KDC's time must be within 5 minutes of each other for Kerberos to function.  Both systems should run some form of time synchronization system like Network Time Protocol (NTP).

Setting up /etc/krb5/krb5.conf.
Obtaining TGT for benr/admin ...
Password for benr/admin@CUDDLETECH.COM:
localhost: RPC: Rpcbind failure - RPC: Success
kinit:  no ktkt_warnd warning possible

host/zone1.cuddletech.com entry ADDED to KDC database.
host/zone1.cuddletech.com entry ADDED to keytab.

---------------------------------------------------
Setup COMPLETE.

Easy as that. Please note that the RPC failure did suceed and didn’t effect the configuration.

I want to note explicitly that in the above examples the hosts involved, namely the KDC, were present in DNS. Trying to do the same configuration above without DNS is not so easy and if you need to do so I recommend avoiding kdcmgr and using the manual procedure in my previous Kerberos post.

With regard to SSH, in Solaris 10 there is no need to make any changes to accomidate SSH. You do not need to edit /etc/pam.conf or even change the /etc/ssh/sshd_config file. Setup a KDC as above, setup a client as above (a zone for instance), get a ticket (“kinit”) and SSH… your rockin’ and rollin’.

If you need help with your installation or want to get connected with other Solaris Kerberos fans, visit the OpenSolaris Kerberos Project.

Happy SysAdmin Day

Friday, July 25th, 2008

Today is everyones favorite day of the year: SysAdmin Day! To all my fellow admins, from myself and Joyent, a very warm pat on the back and “thank you” for all that hard work that no one else acknowledges. So kick back, enjoy a cold pint, and bask in the glory that is UNIX Systems Administration!

WARGAMES IN THEATER TONIGHT ONLY!!!!!!!!

Thursday, July 24th, 2008

This is very important, WarGames, the most important geek film ever made, is in theaters TONIGHT at 7:30pm. ONE SHOWING ONLY.

This film has been hugely influential for me, and many others. That awesome unbuttoned shirt over tshirt look that we emulate to this day, decades of trying to get text-to-voice to sound like the film (which was an actor reading the lines word by word in reverse), and who didn’t fall in love with Ally Sheedy!!!

Don’t be dumb!!! Go! Tonight! 7:30PM!!!

DTrace IP Provider… Oh no you didn’t….

Wednesday, July 23rd, 2008

In my previous post about the IP Provider I got the following comment: “There is nothing unpleasant about the wonderfulness that is tcpdump! You’ll need to put a lot of work in to match tcpdump’s usefulness with Dtrace…”

That just sounds like a challenge. Bring it on! Can snoop or tcpdump do this?

root@ultra ~$ ./ip_whosent.d
Packet sent to 192.168.100.4: 88 byte packet on behalf of ssh (PID: 1075)
Packet sent to 192.168.100.4: 88 byte packet on behalf of ssh (PID: 1075)
Packet sent to 208.67.222.222: 56 byte packet on behalf of nscd (PID: 152)
Packet sent to 208.67.222.222: 71 byte packet on behalf of nscd (PID: 152)
Packet sent to 208.67.222.222: 56 byte packet on behalf of nscd (PID: 152)
Packet sent to 72.14.207.99: 52 byte packet on behalf of firefox-bin (PID: 1944)
Packet sent to 8.12.32.9: 52 byte packet on behalf of thunderbird-bin (PID: 1133)
Packet sent to 8.12.32.9: 54 byte packet on behalf of thunderbird-bin (PID: 1133)
Packet sent to 8.12.32.9: 87 byte packet on behalf of thunderbird-bin (PID: 1133)
Packet sent to 8.12.32.9: 58 byte packet on behalf of thunderbird-bin (PID: 1133)
Packet sent to 8.12.32.9: 64 byte packet on behalf of thunderbird-bin (PID: 1133)
Packet sent to 8.12.32.9: 65 byte packet on behalf of thunderbird-bin (PID: 1133)
Packet sent to 208.67.219.230: 644 byte packet on behalf of firefox-bin (PID: 1944)
Packet sent to 208.67.219.230: 637 byte packet on behalf of firefox-bin (PID: 1944)
Packet sent to 72.14.207.99: 660 byte packet on behalf of firefox-bin (PID: 1944)
Packet sent to 208.67.219.230: 52 byte packet on behalf of firefox-bin (PID: 1944)
Packet sent to 208.67.219.230: 664 byte packet on behalf of firefox-bin (PID: 1944)
Packet sent to 8.12.32.9: 48 byte packet on behalf of thunderbird-bin (PID: 1133)
Packet sent to 72.14.207.99: 40 byte packet on behalf of firefox-bin (PID: 1944)
^C

Here is the script:

#!/usr/sbin/dtrace -qs 

ip:ip:*:send
/execname != "sched"/
{
        printf("Packet sent to %s: %d byte packet on behalf of %s (PID: %d)n",
                        args[2]->ip_daddr, args[4]->ipv4_length, execname, pid );
}

Oh but wait……. how about a full call stack on each sent packet? Just add a new line to the above script: stack();

root@ultra ~$ ./ip_sentstack.d
Packet sent to 72.14.207.99: 84 byte packet on behalf of ping (PID: 2020)

              ip`ip_wput_ire+0x21f5
              ip`ire_send+0x1c9
              ip`ire_add_then_send+0x2b9
              ip`ip_newroute+0xa0a
              ip`ip_output_options+0x18c7
              ip`icmp_wput+0x44a
              unix`putnext+0x22b
              genunix`strput+0x1ad
              genunix`kstrputmsg+0x261
              sockfs`sosend_dgram+0x26e
              sockfs`sotpi_sendmsg+0x4a8
              sockfs`sendit+0x160
              sockfs`sendto+0x8e
              sockfs`sendto32+0x2d
              unix`sys_syscall32+0x101

Or check out one of the examples on the IP Provider wiki page (this is almost certainly by Brendan Gregg):

# ./ipio.d
 CPU  DELTA(us)          SOURCE               DEST      INT  BYTES
   1     598913    10.1.100.123 ->   192.168.10.75  ip.tun0     68
   1         73   192.168.1.108 ->     192.168.5.1     nge0    140
   1      18325   192.168.1.108 <-     192.168.5.1     nge0    140
   1         69    10.1.100.123 <-   192.168.10.75  ip.tun0     68
   0     102921    10.1.100.123 ->   192.168.10.75  ip.tun0     20
   0         79   192.168.1.108 ->     192.168.5.1     nge0     92

Here is the script:

#!/usr/sbin/dtrace -s

#pragma D option quiet
#pragma D option switchrate=10hz

dtrace:::BEGIN
{
        printf(" %3s %10s %15s    %15s %8s %6sn", "CPU", "DELTA(us)",
            "SOURCE", "DEST", "INT", "BYTES");
        last = timestamp;
}

ip:::send
{
        this->elapsed = (timestamp - last) / 1000;
        printf(" %3d %10d %15s -> %15s %8s %6dn", cpu, this->elapsed,
            args[2]->ip_saddr, args[2]->ip_daddr, args[3]->ill_name,
            args[2]->ip_plength);
        last = timestamp;
}

ip:::receive
{
        this->elapsed = (timestamp - last) / 1000;
        printf(" %3d %10d %15s <- %15s %8s %6dn", cpu, this->elapsed,
            args[2]->ip_daddr, args[2]->ip_saddr, args[3]->ill_name,
            args[2]->ip_plength);
        last = timestamp;
}

Can DTrace decrypt IPsec ESP payloads? No. Ok, so tcpdump isn’t dead yet, but the capabilities offered by DTrace are far deeper. I’ve got a ton of ideas more that I could put here, but don’t have time atm. DTrace for the win!

DTrace IP Provider

Tuesday, July 22nd, 2008

Recently introduced (snv_92) is the first piece of the DTrace Network Providers, the DTrace IP Provider. Here is a taste:

root@ultra include$ dtrace -qn 'ip:ip:*:receive{ printf("Packet recieved from %s: %d byte packetn", args[2]->ip_saddr, args[4]->ipv4_length ); }'
Packet recieved from 74.125.15.85: 40 byte packet
Packet recieved from 74.125.15.85: 40 byte packet
Packet recieved from 8.11.47.20: 88 byte packet
Packet recieved from 8.11.47.20: 216 byte packet
Packet recieved from 8.11.47.20: 200 byte packet
Packet recieved from 8.11.47.20: 136 byte packet
Packet recieved from 8.11.47.20: 104 byte packet
^C

Pretty soon snoop and tcpdump will be nothing more than unpleasant memories. :)

A big thank you to the DTrace Team!!!

Solaris IPsec: Shared Key Transport Mode

Saturday, July 19th, 2008

In this entry we’ll build on our our IPsec Basics discussed last time and actually create an IPsec connection.

IPsec can be used for direct system-to-system access known as “transport mode” or to create a virtual pipeline into which everything is encrypted, known as “tunnel mode”. We’re going to look at transport mode, which is an excellent solution for encrypting otherwise unencrypted protocols, such as SNMPv1/2 or telnet.

When encrypting and decrypting data we need keys. This can be done using PKI certificates or IKE generated one-time keys, but in this examples for simplicity sake we’ll create our own “static” keys which will be used on both ends of the connection, thus said to be “pre-shared”.

Creating Keys

Using the ipsecalgs command we can see the available algorithms, including DES, 3DES, AES, Blowfish, SHA and MD5. Different alogithms require different key lengths, for instance 3DES requires a 192 bit key, whereas Blowfish can use a key anywhere from 32bits up to 448 bits.

For interoperability reasons (such as OSX or Linux), you may with to create keys that are both ASCII and hex. This is done by choosing a string and converting it to hex. To know how long a string should be, divide the number of bits required by 8, this is the number of ASCII chars you need. The hex value of that ASCII string will be double the number of ASCII chars. Using the od utility we can convert ASCII-to-hex. Here I’ll create 2 keys, one for AH which is a SHA1 160bit key (20 ASCII chars) and another for ESP which is a Blowfish 256bit key (32 ASCII chars):

benr@ultra ~$ echo "my short ah password" | od -t x1
0000000 6d 79 20 73 68 6f 72 74 20 61 68 20 70 61 73 73
0000020 77 6f 72 64 0a
0000025
benr@ultra ~$ echo "this is my long blowfish esp pas" | od -t x1
0000000 74 68 69 73 20 69 73 20 6d 79 20 6c 6f 6e 67 20
0000020 62 6c 6f 77 66 69 73 68 20 65 73 70 20 70 61 73
0000040 0a
0000041

To ensure proper length, I like using a little text-rule like you see below in vi:

         1         2    2    3 3       4         5         6   6     7
1234567890 234567890 234567890 234567890 234567890 234567890 234567890
--------------------------------------------------------------------------------------------------------------------
my short ah password
6d792073686f72742061682070617373776f7264

this is my long blowfish esp pas
74686973206973206d79206c6f6e6720626c6f77666973682065737020706173

If you don’t require interoperability by knowing the ASCII equivilent, just grab a random set of hex chars (head /dev/random | od -t x1).

Now that we have a key, lets use it.

Configuring IPsec Policies

IPsec policies are rules that the IP stack uses to determine what action should be taken. Actions include:

  • bypass: Do nothing, skip the remaining rules if datagram matches.
  • drop: Drop if datagram matches.
  • permit: Allow if datagram matches, otherwise discard. (Only for inbound datagrams.)
  • ipsec: Use IPsec if the datagram matches.

As you can see, this sounds similar to a firewall rule, and to some extent can be used that way, but you ultimately find IPFilter much better suited to that task. When you plan your IPsec environment consider which rules are appropriate in which place.

IPsec policies are defined in the /etc/inet/ipsecinit.conf file, which can be loaded/reloaded using the ipsecconf command. Lets look at a sample configuration:

benr@ultra inet$ cat /etc/inet/ipsecinit.conf
##
##  IPsec Policy File:
##

# Ignore SSH
{ lport 22 dir both } bypass { }

# IPsec Encrypt telnet Connections to 8.11.80.5
{ raddr 8.11.80.5 rport 23 } ipsec { encr_algs blowfish encr_auth_algs sha1 sa shared }

Our first policy explicitly bypasses connections in and out (“dir both”, as in direction) for the local port 22 (SSH). Do I need this here? No, but I include it as an example. You can see the format, the first curly block defines the filter, the second curly block defines parameters, the keyword in between is the action.

The second policy is what we’re interested in, its action is ipsec, so if the filter in the first curly block matches we’ll use IPsec. “raddr” defines a remote address and “rport” defines a remote port, therefore this policy applies only to outbound connections where we’re telnet’ing (port 23) to 8.11.80.5. The second curly block defines parameters for the action, in this case we define the encryption algorithm (Blowfish), encryption authentication algorithm (SHA1), and state that the Security Association is “shared”. This is a full ESP connection, meaning we’re encrypting and encapsulating the full packet, if we were doing AH (authentication only) we would only define “auth_algs”.

Now, on the remote side of the connection (8.11.80.5) we create a similar policy, but rather than “raddr” and “rport” we use “laddr” (local address) and “lport” (local port). We could even go so far as to specify the remote address such that only the specified host would use IPsec to the node. Here’s that configuration:

##  IPsec Policy File:
##

# Ignore SSH
{ lport 22 dir both } bypass { }

# IPsec Encrypt telnet Connections to 8.11.80.5
{ laddr 8.11.80.5 lport 23 } ipsec { encr_algs blowfish encr_auth_algs sha1 sa shared }

To load the new policy file you can refresh the ipsec/policy SMF service like so: svcadm refresh ipsec/policy. I recommend avoiding the ipsecconf command except to (without arguments) display the active policy configuration.

So we’ve defined policies that will encrypt traffic from one node to another, but we’re not done yet! We need to define a Security Association that will association keys with our policy.

Creating Security Associations

Security Associations (SAs) can be manually created by either using the ipseckeys command or directly editing the /etc/inet/secret/ipseckeys file, I recommend the latter, I personally find the ipseckeys shell very intimidating.

Lets look at a sample file and then discuss it:

add esp spi 1000 src 8.15.11.17 dst 8.11.80.5 auth_alg sha1 authkey 6d792073686f72742061682070617373776f7264 encr_alg blowfish encrkey 6d792073686f72742061682070617373
add esp spi 1001 src 8.11.80.5 dst 8.15.11.17 auth_alg sha1 authkey 6d792073686f72742061682070617373776f7264 encr_alg blowfish encrkey 6d792073686f72742061682070617373

It looks more intimidating that it is. Each line is “add”ing a new static Security Association, both are for ESP. The SPI is the “Security Parameters Index”, is a simple numeric value that represents the SA, nothing more, pick any value you like. The src and dst define the addresses to which this SA applies, note that you have two SA’s here, one for each direction. Finally, we define the encryption and authentication algorithms and full keys.

I hope that looking at this makes it more clear how policies and SA’s fit together. If the IP stack matches a datagram against a policy who’s action is “ipsec”, it takes the packet and looks for an SA who’s address pair matches, and then uses those keys for the action encryption.

Note that if someone obtains your keys your hosed. If you pre-shared keys in this way, change the keys from time-to-time or consider using IKE which can negotiate keys (and thus SAs) on your behalf.

To apply your new SA’s, flush and then load using the ipseckeys command:

$ ipseckey flush
$ ipseckey -f /etc/inet/secret/ipseckeys

Is it working? How to Test

All this is for nothing if you don’t verify that the packets are actually encrypted. Using snoop, you should see packets like this:

$ snoop -d e1000g0
Using device e1000g0 (promiscuous mode)
ETHER:  ----- Ether Header -----
ETHER:
ETHER:  Packet 1 arrived at 9:52:4.58883
ETHER:  Packet size = 90 bytes
ETHER:  Destination = xxxxxxxxxxx,
ETHER:  Source      = xxxxxxxxxx,
ETHER:  Ethertype = 0800 (IP)
ETHER:
IP:   ----- IP Header -----
IP:
IP:   Version = 4
IP:   Header length = 20 bytes
IP:   Type of service = 0x00
IP:         xxx. .... = 0 (precedence)
IP:         ...0 .... = normal delay
IP:         .... 0... = normal throughput
IP:         .... .0.. = normal reliability
IP:         .... ..0. = not ECN capable transport
IP:         .... ...0 = no ECN congestion experienced
IP:   Total length = 72 bytes
IP:   Identification = 36989
IP:   Flags = 0x4
IP:         .1.. .... = do not fragment
IP:         ..0. .... = last fragment
IP:   Fragment offset = 0 bytes
IP:   Time to live = 61 seconds/hops
IP:   Protocol = 50 (ESP)
IP:   Header checksum = ab9c
IP:   Source address = XXXXXXXXX
IP:   Destination address = XXXXXXXXXXXX
IP:   No options
IP:
ESP:  ----- Encapsulating Security Payload -----
ESP:
ESP:  SPI = 0x3e8
ESP:  Replay = 55
ESP:     ....ENCRYPTED DATA....

And there you go. You can no encrypt communication transparently in the IP stack. Its a little effort to get going, but once its running your done… just remember to rotate those keys every so often!

Why do I care about this again?

In this modern era where SSH is the standard for communication its easy to get jaded. Either you can communicate via SSH or easily create a tunnel to get the job done. But lets face it, SSH is massively overused, and in many cases SSH tunneling is just downright ghetto. With IPsec we can as easily encrypt 100 ports as 1, whereas with SSH thats very ugly. Furthermore, there are many instances in which you want a secure communications channel thats as transparent as possible, such as a network database connection that doesn’t offer native encryption or perhaps an SCM or even SNMPv1/2.

While most applications today provide some type of encryption capability it surprising how few people leverage them unless they are the default. In situations where its difficult or impractical to use encryption in the application, IPsec can be a really sweet solution.

Solaris IPsec: Basics

Saturday, July 19th, 2008

IPsec is a technology widely known. Created for IPv6 and backported to IPv4, it adds a security layer into the IP stack. Prior to IPsec we needed to encrypt data before sending it down the stack and then decrypt it on the other side once it came back out using technologies like SSH or SSL/TLS. IPsec simplifies this by transparently handling encrypt/decrypt as well as header authentication.

But for all its intended simplicity and transparency it is to the administrator anything but. If you google around you find piles of email on various lists of those who tried to get IPsec working and gave up in frustration and instead used OpenVPN, on both Linux and Solaris.

Part of the reason IPsec is so complex to manage is the number of technologies that all must work together properly in order to get something functional. This includes acronyms like AH, ESP, IKE, DH, RSA, SA, SAD, etc. In many respects its similar to the frustration encountered the first time you approach LDAP and are overwhelmed with OU=, CN=, DN=, etc. Just like LDAP, with a little practice you get it sorted out and start making ground. We’ll start high level and zoom in… please please, read this before just copying examples! (I know you won’t, but it’ll save you time.)

IPsec is fundamentally similar to a firewall, in that you specify IPsec Policies that determines how datagrams are handled. If a policy matches a datagram its handled… if no policy matches it acts like normal.

Thats where the firewall analogy stops, like a firewall the policy says what to do and when to do it…. but we need to know something else for encrypting or decrypting data, we also need to know how to do it. The information that explains the how is known as a Security Association (SA). This is maintained in a special database named, unremarkably, the Security Association Database (SAD or SADB, depending on who you ask, same thing though).

It’s the SA that actually contains the keys used for encrypting, decrypting and authenticating IPsec datagrams. So a packets goes through the stack, a policy says “If the datagram source is 1.2.3.4 and the destination is 5.6.7.8 use IPsec.” Now that the stack knows to use IPsec to encrypt, it goes looking in the SADB for an SA that contains the keys. There is a lot of duplication here because they are independent things, so you specify the authentication algorithm, encryption algorithm, source and destination addresses in both the IPsec policy and the SA. Its strange at first. :)

Now, just as IPsec is supposed to make life simple and transparent, so is IKE (Internet Key Exchange). An IKE daemon runs on both sides of a connection and negotiates SA’s for you. This provides a variety of benefits, but simplicity isn’t exactly one of them because you still need to configure IKE rules which are similar to SA’s. In other words, you can’t just create an IPsec policy, and then enable IKE and be done.

Before we get into the examples, lets looks at the various files and commands involved:

Files:

    * IPsec:
          o /etc/inet/ipsecinit.conf: IPsec Policy Definitions
          o /etc/inet/secret/ipseckeys: IPsec SA Definitions
    * IKE:
          o /etc/inet/ike/config: IKE Global Configuation and Rules
          o /etc/inet/secret/ike.preshared: File containing Preshared Key Definitions
          o /etc/inet/secret/ike.privatekeys/: Directory containing IKE Private Keys
          o /etc/inet/ike/publickeys/: Directory containing IKE Public Keys
          o /etc/inet/ike/crls/: Directory containing IKE Certificates 

Commands:

    * IPsec:
          o ipsecconf: Load or Display IPsec Policy Configuration
          o ipsekkey: Manually manipulate IPsec Security Association Database (SADB)
          o ipsecalgs: Display available IPsec ESP/AH Algorithms
    * IKE:
          o in.iked: IKE Daemon
          o ikeadm: Manipulate IKE parameters and state (flush, add, get, set, ...)
          o ikecert: Manipulate IKE's on-filesystem public-key certificate databases

The simplest configuration would be to “pre-share” keys, meaning create a key and manually put it on both systems. In my next blog entry we’ll step through actually creating an IPsec policy, creating keys, creating Security Associations, and testing an IPsec connection.

Explore Your Storage with FileBench

Friday, July 18th, 2008

FileBench is one of the most powerful and flexible benchmarking tools around. Your typical tool like Bonnie++ or IOzone tend to take some discrete operation and do it multiple times at differing block sizes (8K file in 1K blocks, 8K file in 2K blocks, etc). These commonly used benchmarks are known as “micro-benchmarks”. Using them tends to be controversial and can be confusing, leading to claims like “My new Seagate disk gets 800MB/s!!!” In order to make them useful you need to use DirectIO or preform operations that are larger than you installed RAM to avoid “cache effect”.

By contrast, FileBench is better described as an “application simulator” or as I prefer to call it, a “workload generator”. Whereas most benchmarks may use only a single file, FileBench creates filesets prior to actually running a workload. In this way, it can pre-create thousands of files in hundreds of directories with file of varying sizes (all with in given ranges) on which to actually test. This gives you much more realistic ideas of what performance may actually look like in production.

FileBench workloads are actually scripts in the “F” language which define flowops, such as createfile, deletefile, fsync, closefile, etc. This means that you can effectively model bazaar or unusual scenarios, like creating thousands of new files, writing 1 byte, and closing each. Furthermore, because we’re working on a much larger scale, we can leave caching enabled to see how caching helps or hurts a workload.

All that said, FileBench is somewhat non-intuative and is in a lot of flux now. These things are slowly being worked out, and I myself am trying to chip in, but there is a lot of polish to be slapped on this thing.

To get started with FileBench, download the latest version. Packages are available for Solaris (X86 & SPARC), as well as source. Please note that if you install on SPARC or in a Zone the “isaexec” links will fail… in that case, just copy the appropriate binaries (amd64/ or sparc/) into /opt/filebench/bin. Furthermore, please note that the “amd64/” is a farce, FileBench is distributed 32bit X86, not 64bit. (That’ll be fixed soon.)

Running a Single Workload

In /opt/filebench/bin/(platform/) you will find a binary called “go_filebench”, this is Filebench itself. Invoking it will start an interactive shell. Here you can load a workload to run. The workloads are found in /opt/filebench/workloads. The workloads are “F” script files that you can look at and modify to fit your specific need. Each workload has variables associated with it that determine where to run the benchmark (the directory or filesystem you wish to test), number of files, filesize, thread count, etc. We can either create a custom workload with the variable values we want, or we can modify them in the interactive shell prior to run.

Lets take a simple example. In this case I’ve decided to run the “varmail” workload on “/pool/test” (a ZFS dataset):

root@ultra ~$ /opt/filebench/bin/amd64/go_filebench
FileBench Version 1.3.3
filebench> load varmail
 8429: 4.475: Varmail Version 2.1 personality successfully loaded
 8429: 4.475: Usage: set $dir=
 8429: 4.475:        set $filesize=    defaults to 16384
 8429: 4.476:        set $nfiles=     defaults to 1000
 8429: 4.476:        set $nthreads=   defaults to 16
 8429: 4.476:        set $meaniosize= defaults to 16384
 8429: 4.476:        set $readiosize=  defaults to 1048576
 8429: 4.476:        set $meandirwidth= defaults to 1000000
 8429: 4.476: (sets mean dir width and dir depth is calculated as log (width, nfiles)
 8429: 4.476:  dirdepth therefore defaults to dir depth of 1 as in postmark
 8429: 4.476:  set $meandir lower to increase depth beyond 1 if desired)
 8429: 4.476:
 8429: 4.476:        run runtime (e.g. run 60)
filebench> set $dir=/pool/test
filebench> run 60
 8429: 39.650: Creating/pre-allocating files and filesets
 8429: 39.656: Fileset bigfileset: 1000 files, avg dir = 1000000, avg depth = 0.5, mbyte
s=15
 8429: 39.657: Creating fileset bigfileset...
 8429: 46.876: Preallocated 812 of 1000 of fileset bigfileset in 8 seconds
 8429: 46.876: waiting for fileset pre-allocation to finish
 8429: 46.876: Starting 1 filereader instances
 8430: 47.883: Starting 16 filereaderthread threads
 8429: 50.893: Running...
 8429: 111.443: Run took 60 seconds...
 8429: 111.445: Per-Operation Breakdown
closefile4                382ops/s   0.0mb/s      0.0ms/op        5us/op-cpu
readfile4                 382ops/s   6.3mb/s      0.0ms/op       28us/op-cpu
openfile4                 382ops/s   0.0mb/s      0.0ms/op       29us/op-cpu
closefile3                382ops/s   0.0mb/s      0.0ms/op        6us/op-cpu
fsyncfile3                382ops/s   0.0mb/s     19.8ms/op       31us/op-cpu
appendfilerand3           382ops/s   3.0mb/s      0.0ms/op       43us/op-cpu
readfile3                 382ops/s   6.3mb/s      0.0ms/op       28us/op-cpu
openfile3                 382ops/s   0.0mb/s      0.0ms/op       29us/op-cpu
closefile2                382ops/s   0.0mb/s      0.0ms/op        6us/op-cpu
fsyncfile2                382ops/s   0.0mb/s     20.8ms/op       34us/op-cpu
appendfilerand2           382ops/s   3.0mb/s      0.0ms/op       32us/op-cpu
createfile2               382ops/s   0.0mb/s      0.1ms/op       71us/op-cpu
deletefile1               382ops/s   0.0mb/s      0.0ms/op       44us/op-cpu

 8429: 111.445:
IO Summary:      300414 ops 4961.5 ops/s, (763/763 r/w)  18.6mb/s,    145us cpu/op,  10.2ms latency
 8429: 111.445: Shutting down processes
filebench> quit

Here I loaded the “varmail” workload (actual file is “/opt/filebench/workloads/varmail.f”) and set the directory to run in as my test directory. The rest of the defaults I leave alone. In the output we see that first it created a fileset, in this case it created 1000 files in 1 directory, each file with a random size…. here’s a look:

root@ultra 00000001$ ls -lh | more
total 11M
-rw-r--r-- 1 root root 3.7K Jul 18 15:31 00000001
-rw-r--r-- 1 root root 7.6K Jul 18 15:31 00000002
-rw-r--r-- 1 root root  11K Jul 18 15:31 00000003
-rw-r--r-- 1 root root 5.6K Jul 18 15:31 00000004
-rw-r--r-- 1 root root  13K Jul 18 15:31 00000005
-rw-r--r-- 1 root root 1.4K Jul 18 15:31 00000006
-rw-r--r-- 1 root root  16K Jul 18 15:31 00000007
-rw-r--r-- 1 root root  921 Jul 18 15:31 00000008
-rw-r--r-- 1 root root 8.2K Jul 18 15:31 00000009
-rw-r--r-- 1 root root  15K Jul 18 15:31 00000010
-rw-r--r-- 1 root root  11K Jul 18 15:31 00000011
-rw-r--r-- 1 root root 7.2K Jul 18 15:31 00000012
-rw-r--r-- 1 root root 7.9K Jul 18 15:31 00000013
-rw-r--r-- 1 root root 2.7K Jul 18 15:31 00000014
-rw-r--r-- 1 root root 1.1K Jul 18 15:31 00000015

By tweeking the variables we can increase the spread.

Finally, in the output we see that the workload ran for the time (in seconds) that we specified and then dumped out both op specific and aggregate stats. For instance, we can easily see in the output that its fsync’s that are the most time consuming operation.

Running Multiple Workloads with BenchPoint

Instead of running just a single workload we commonly want to run several. This might be different workloads or even the same workload repeatedly but with different settings. We can do that with BenchPoint (currently named: “/opt/filebench/bin/filebench”; confusing I know.)

BenchPoint is actually a PERL framework around the FileBench (“go_filebench”) binary. It utilizes a profile that defines all the workloads we with to run as well as all the variable settings for those. It then uses some additional scripts to handle special operations (such as exporting/importing ZFS pools after each run) and stats collection (such as watching vmstat during a run).

To get started, go into the /opt/filebench/config/ directory and look at the various *.prof files. When you see one you like, copy it to some other location, such as /tmp. Do not use the profiles in config/ as is!!! Customize them somewhere else!!! Now edit to taste, changing the global $dir to the location you wish to execute the workloads, etc.

Now that we have a customized profile, lets give it a run:

root@ultra config$ cp filemicro.prof /tmp/benr_filemicro.prof
root@ultra config$ cd /tmp
root@ultra tmp$ vi benr_filemicro.prof
...
root@ultra tmp$ more benr_filemicro.prof
# ident "@(#)filemicro.prof     1.2     08/03/31 SMI"

DEFAULTS {
        runtime = 60;
        dir = /pool/test/;
        stats = /tmp/stats;
        filesystem = nofs;
        description = "FileMicro Testing";
}

Please note, I’m using ZFS but I specified the above filesystem as “nofs”… thats because if you specify “zfs” benchpoint will export/import the zpool prior to each workload… if you do not want this, specify anything other than “zfs”.

Now lets run this profile. Please note, you need to be in the local directory with your custom profile and you must omit the “.prof” suffix.

root@ultra tmp$ /opt/filebench/bin/filebench benr_filemicro
parsing profile for config: createandalloc
Running /tmp/stats/ultra-nofs-benr_filemicro-Jul_18_2008-15h_50m_11s/createandalloc/thisrun.f
FileBench Version 1.3.3
 8458: 0.021: FileMicro-Create Version 2.1 personality successfully loaded
 8458: 0.021: Creating/pre-allocating files and filesets
 8458: 0.021: File largefile: mbytes=0
 8458: 0.021: Creating file largefile...
 8458: 0.021: Preallocated 1 of 1 of file largefile in 1 seconds
 8458: 0.021: waiting for fileset pre-allocation to finish
 8458: 0.022: Running '/opt/filebench/scripts/fs_flush nofs /pool/test/'
filesystem type is: nofs, no action required, so exiting
 8458: 0.031: Change dir to /tmp/stats/ultra-nofs-benr_filemicro-Jul_18_2008-15h_50m_11s/createandalloc
 8458: 0.031: Starting 1 filecreater instances
 8461: 1.035: Starting 1 filecreaterthread threads
 8458: 4.045: Running...
 8458: 5.055: Run took 1 seconds...
 8458: 5.055: Per-Operation Breakdown
finish                    507ops/s   0.0mb/s      0.0ms/op        2us/op-cpu
append-file               508ops/s 507.0mb/s      1.6ms/op     1615us/op-cpu

 8458: 5.055:
IO Summary:        513 ops 508.0 ops/s, (0/508 r/w) 507.0mb/s,   1666us cpu/op,   1.6ms latency
 8458: 5.055: Stats dump to file 'stats.createandalloc.out'
 8458: 5.055: in statsdump stats.createandalloc.out
 8458: 5.055: Shutting down processes
Generating html for /tmp/stats/ultra-nofs-benr_filemicro-Jul_18_2008-15h_50m_11s
file = /tmp/stats/ultra-nofs-benr_filemicro-Jul_18_2008-15h_50m_11s/createandalloc/stats.createandalloc.out

parsing profile for config: createandallocsync
Running /tmp/stats/ultra-nofs-benr_filemicro-Jul_18_2008-15h_50m_11s/createandallocsync/thisrun.f
FileBench Version 1.3.3
 8469: 0.012: FileMicro-Create Version 2.1 personality successfully loaded
....

The key to success with FileBench isn’t in the stats that it outputs, but rather by stats you can gather during its load. Use tools like zpool iostat, iostat, vmstat, or even DTrace.

FileBench also includes several handy utilities to assist in your benchmarking, but I’ll discuss those separately in the future.

For more info on FileBench try these links:

Thoughts on “Open Storage”

Friday, July 11th, 2008

Some marketing terms come along that make you stop and think. Sun is pushing Open Storage, pairing up terms like “revolution”, and you have to ask: Whats really new here? I suppose you have to step back and consider that all industries are not the same and what one customer considers “catching up with reality”, another customer considers “a fresh new approach”.

When I think about what Sun concept of Open Storage really boils down to it is this: servers aren’t just storage clients. If you think about the direction Fibre Channel and even iSCSI solutions were going, the drive was to push more and more of the storage management and access into array controllers such that servers are clients only. I think I told the story in this blog some time ago when I stormed out of HDS’s data center when I realized you required a Windows server to manage the array. Storage should be autonomous!

But things have changed. When I stormed out of HDS I was managing an environment of large SPARC systems that had 1 or 2 internal disks just for the OS, or small 1U X86 servers with just enough local disk for the OS and apps. With the increasing availability of high performance multi-core CPU’s 2U’s are more attractive and local disk storage is commonly managed by a dedicated RAID card with onboard cache of up to 512MB. When you have racks full of 2U systems that each have more than 2.5TB of RAID6 and a write-back cache to boot in each machine… its time to think differently. Filesystems like Lustre or even pNFS (parallel NFS) look very attractive to the enterprise…. yet again, HPC technology trickles down to the enterprise market.

While the push from Sun has just started publicly this year, there has been signs of this for a long time, especially when Jonathan declared many moons ago that all proprietary OS’s would have to go, which at the time was shocking given that all the storage arrays ran various embedded or specialized OS’s. So, it should be noted that this would seem to be the fulfillment of something Sun has been working toward for quite some time, unified under a single banner of “Open Storage”.

The implications could really change the landscape though. Traditionally in large enterprise storage you spend a lot of time working with vendors, testing configurations, listening to presos, etc. It was a very hands-off world. This new push would mean that Storage Administrators are going to spend less time making purchasing decisions and more time learning how to install, manage, and optimize their deployments. When “secure storage” goes from checking a box to configuring IPsec things get sticky. But that also provides new opportunities for administrators and vendors alike. In fact, that reminds me of something…. :)

So the real question is, how will “traditional” storage vendors like HDS and EMC respond? If you don’t have a server business getting behind the idea of buying servers and JBOD’s isn’t terribly attractive. That suggests that in 3 years companies like Dell, Sun, IBM, and HP will rule the storage world leaving EMC to supply a dying market while it continues to cash in on its acquisitions like VMware and RSA.

So, like I said in the beginning…. “Open Storage” is either something mind-numbingly obvious or something radically new, depending on where you sit.

Sun Introduces New “Open” Storage Array Line: J4000

Friday, July 11th, 2008

Sun’s recently been on an “Open Storage” kick. They define this as using “industry standard components” together with “open source software”. Frankly, the pitch sounds pretty similar to the one Sun has had for the last 2 decades of “open standards” products… the new tact is really just pitching the cost savings of specifically depending on open source software freeing you (potentially) from high licensing costs.

So, there are 3 arrays, we’ll look at them each.

The Sun Storage J4200 array is an single or dual controller external SAS JBOD and offers 3 SAS ports per controller. It supports “Hardware Raid”, but notice that its “(with RAID HBA)”, so there is no hardware RAID happening on the controllers (the same goes for similar solutions from Dell). This unit is 2U and features 12 3.5″ drives. While the interconnect is SAS, you can use either SAS or SATA drives. The cheapest setup is single controller with 2 250GB disks, for $3,140.00. If you customize one with a fairly normal config of dual controllers, 12x 500GB SATA drives (7200RPM), plus 1 cable, 1 HBA, and a rail kit you come up to around $9,000 with a raw capacity of 6TB.

The Sun Storage J4400 Array is the same basic array but with more disk capacity. You still don’t get hardware RAID in the controllers, only on the “RAID HBA”. The chassis accommodates up to 24 drives in 4U, and starts at $7,410.00, although that price is single controller with 12x 73GB SAS drives. A reasonably stocked config with 24x 500GB SATA drives, dual controllers, dual HBA’s, cables, and rack kit come to just over $20,000 with a raw capacity of 12TB.

Lastly, the one many people have been waiting for… the Sun Storage J4500 Array. This is a Thumper but notice whats wrong in the picture above? The server component is replaced with SAS controllers. This is a 48 disk in 4U storage array in the traditional sense, not a hybrid server. The unit does not feature independently replaceable controllers, but otherwise shares in the basic vibe of its siblings in the lineup. “Four 3 Gb/sec SAS ports per tray”, however 2 are server ports and 2 are expansion (daisy chain) ports. Prices start at $32,960.00 for a 48x 500GB SATA setup (24TB raw), and go up to $60,960.00 for a 24x 1TB SATA setup (48TB raw). SAS disks are not an option in the unit, and you will have to use a “RAID HBA” for “hardware RAID”.

One thing I’ll note is that they are branded “Sun Storage”, rather than “StorageTek”. I did find one or two “Sun StorageTek J4000″ references around but very few and they looked like mistakes. I’m supportive. :)

In general, I think its good to see Sun pushing new storage product. Does it differentiate enough from the offerings in its line? I dunno. Clearly Sun is addressing the demise of Fibre Channel is a large segment of the market, but competition in that space is with Dell and the like and very competitive. For instance, a Dell MD1000 configured like the J4200 above (12x 500GB SATA, dual controller standard, single HBA, cable, rack kit, etc) is $7,000… versus $9,000 for a J4200 with only a single controller. Which is better? That comes down to the HBA actually, and I’m terrified (RAID 1E? I only see that on Adaptec) the HBA is an Adaptec controller rebranded as StorageTek. The Dell PERC HBA’s (LSI MegaSAS) are the best around, hands down.

What Sun has that Dell, or anyone else, doesn’t have is the software. Lustre is now Sun, ZFS is Sun, SAMFS/QFS are Sun, not to mention HoneyComb and Sun’s work on pNFS. Sun has the storage software unlike any other vendor in the industry. I can only hope that Sun is pushing the “Open Storage” to raise awareness, not of something new, but what its been dedicated to for some time as part of OpenSolaris. If that awareness rises and low barrier to entry gets customers excited they may, hopefully, be willing to pay more money because they’ll have support for hardware and software from a single vendor. Lets hope. ;)