In the podcast we’ll use one of two formats, classic 1-on-1 interview style and a round-table discussion format. This episode is the latter.
Together with Joe Moore of Siemens and Mark Imbriaco of 37signals we discuss the following questions:
What is the mark of a good SA?
What are the essential qualifications?
Does formal education and/or certs matter?
Whats really new and unique is that Joe, Mark, and I don’t know each other. They both responded to a request for participants on the OpenSolaris SA’s list and matched the qualifications I was aiming for, thats the extent of it. This is interesting because even though the three of us are in very different circumstances, have different histories, and are geographically separated, we’re not very dissimilar. It amazes me how much unity there is among a group with so few governing institutions.
The podcast is 1hr 6 mins and definately worth a listen. Feedback is appreciated, but this was the first one, so be kind. (Yes I know my audio was too low.)
Systems Administration is a lonely job. There is very little support for professionals and we’re all very divided. There is, I find, a common desire to know what “the other guy” is doing or thinking, what tools they use, how they arrange their day, how they deal with stress, how they select one gig over another, how they provide for a wife and family in it all.
A podcast is a great way to do this, but there isn’ t much out there. The closest was Kevin Devin’s “From the Trenches”, which appears to be dead, and always focused on the Windows/Cisco network segment which is frankly uninteresting to me.
This new podcast will be interview and round-table focused, unlike my previous podcast(s) which were all in the “one guy with a mic” format. Even I didn’t want to listen to them unless I was really bored.
I’m organizing several “admin roundtables” for diverse discussion and starting to line up several interviews with industry super-stars. If you want to participate please let me know. I am particularly interested in interviewing vendor backline support engineers, if you are one please contact me.
Keep your eyes open, I plan to roll out episodes soon.
This, naturally, just adds a massive blow in the “fibre channel is dead” debate; assuming there is anyone that will continue to argue the point. Clearly 4Gb FC-Fabric has a useful place, but its clear that 10Gbps with is various RDMA add-ons will likely do what Infiniband had promised oh so long ago, and once the pricepoint of 10gigE hits the sweet spot in 1-2 years is all over for FC-Fabric.
But more interesting than the technology is the business practice of Brocade. McData was an upstart gunning against the more established Brocade… then McData shot like a rocket and made Brocade look like an aging dog…. solved by acquisition, bringing McData’s lucrative chassis-based core switch business to Brocade who was being relgated more so toward edge switching and competing more and more with Qlogic. They’ve repositioned, trying to appear more like a data management company than storage networking company, including several attempts to popularize terms like FAN (File Area Network… you can hear the thud echo every time its said aloud).
We’ve all seen businesses who were so committed, stupidly, to dying products and services that they refused to diversify and grow with the trends for fear of loosing face. Brocade is, I think, simultaneously admitting defeat and refusing to die. Awesome business savy. A look at any business history shows that failure to commit to trends due to “core competency” only keeps you from becoming all that you can be. Steve Jobs said once, many moons ago, that if Xerox only realized what they had with the ALTO computer developed at XEROX PARC they, Xerox, would be become the giant to rival IBM in the computer market. Xerox had a core competency, it thought, in the copier business… in hind site we see that was absolutely insanely wrong and short sighted; Xerox’s core competency was R&D from which it could spin off companies or divisions to execute on, 3M or Dow are good examples of this.
In that mindset, I can’t wonder if Brocade will be a much bigger player in the future. Right now it looks like desperation to stay afloat, but who knows, in 10 years we may all feel very differently.
About 6 years ago or so I got tired of fixing problem with Tamarah Windows/Linux box and decided to pay the money for a 15″ PowerBook. It was an excellent investment, she could work on the couch, no more lockups and reboots in Windows or mysterious “Bennnnnnn!” problems in Linux. Since then she’s upgraded to a black MacBook, and when I joined Joyent they provided me with a MacBook Pro (which I’m typing on now). So far each of these 3 laptops has lost at least one drive. Since we’ve fallen in love with iTunes and iPhoto these drive failures have been a major blow, and prior to Leopard’s TimeMachine we didn’t do regular backups.
This post will refer solely to drives for personal use. In the datacenter you should be using RAID and/or backup or redundancy method in which case a single drive failure isn’t something you waste time trying to analyze or fix.
I’ve run into 3 major types of drive failure:
PCB Failure: A case in which the PCB has been “fried”. This happened dramtically once when connected an IDE drive to a system and let the disk rest, upside down, ontop of the case. It ran fine for aminute and then pop/spark there was a hole burned in a chip on the PCB. In this case the only solution is to go to eBay and buy an identical drive and swap the PCB.
Click of Death: This means catastrophic damage to a drive. The head is unable to position itself or read data and sweeps the platters in a sort of seizure. This is the sort of problem that likely requires you to open the drive or spend big bucks.
Damaged Cylinders: This is the kind of problem where the drive seems fine, mounts up and you can read for a bit and then hits some area of the platters where it freaks out and eventually spins up and down. This is most clearly seen when you image the drive with dd and it hits some point and exits on max retries.
Information on drive forensics and recovery is sparse. You tend to get one of three answers:
“d00d, totally put it in the freezer and then try it!” Variations come based on how you should protect against condensation, the best I’ve heard is to pack the drive in minute-rice.
“Send it to DriveSavers” (or other) This is super expensive, anywhere from $600 up beyond $2,000. You send them the failed disk and optionally a new drive to restore to. This can take weeks and is only for super extreme cases.
“Just download tool xyz..” There are lots of various software solutions for do-it-yourself drive recovery, most are old DOS based programs recommended on forums populated largely by Windows users.
In my most recent failure, the drive died one day for seemingly no reason. There was no impact or horror story, the OS just locked up, I rebooted and the OS would start to load and then just drift into an infinite slumber. I went through the painstaking process of replacing the drive in my MacBook Pro and re-installed everything from scratch. Once back up and running I put the old drive in a USB enclosure and attempted to image it using dd. Every attempt it would get 19GB into the drive and then give up.
This kind of problem is the easiest to deal with. There are special versions of dd, namely GNU ddrescue, which is just like dd, but instead of failing on bad blocks will track forward after a number of retries untill its read the whole disk, for better or worse.
In the case of my MacBook Pro drive I attached the USB enclosure to my OpenSolaris box, installed ddrescue, and imaged the drive to a file. Of the 80GB drive the tool reported that I lost about 250MB. I then created a ZFS ZVol of 80GB, used traditional dd to copy the image file into the volume, and then exported as an iSCSI target using iscsitadm. Using the globalSAN iSCSI Initiator for OS X I mounted the iSCSI Target, and used OS X “DiskUtility” to verify and repair the HFS+ Volume. All went well and I could then mount the volume and extract data. w00t!!! iSCSI Rules!
The tale of Tamarah’s MacBook drive didn’t end so happily. I had a backup of her laptop but it was really old. Glenn, our son, grabbed the laptop on the table sending it crashing to the tile floor below, hitting on the corner where the drive sits. The laptop was fine, but the drive was toast. After a Mac Genious showed us how to replace the drive I bought a new disk at Fry’s and got things installed and running again, but the drive contained a lot of projects she wanted, and is commonly the case, when I showed her the data from the old backup she was uncertain as to whether it was enough. This is a big problem of the “unknown”, when all your stuff is in one place you commonly forget what exactly is there.
I tried the USB enclosure trick but the drive wouldn’t even spin up… click of death. Given the sensativity of the data I didn’t want to go Rambo on the disk and so we sat down and had a serious discussion about whether or not it was worth having sent to a drive recovery company. The look on her face was enough to tell me what to do, and despite her guilt over the cost I sent it in. After a week and a half, the answer came back “nothing we can do”. The tech was friendly and we had a good discussion about drive recovery, but long story short there was no hope and we were out $800. Frankly, for a lot of people that money is well spent because at least you exhausted all avenues, morn and get on with it.
When it comes to hardcore “swap the platters” style repair things get dicey. As simplistic as hard drives seem there are a lot of gotchas that you won’t be aware of until its too late. This is where Scott Moulton of MyHardDriveDied.com comes in. Scott has done two presentations, both found on YouTube that provide a solid background for the black-art of hardcore drive recovery used by most of the big bucks recovery companies.
Scott Moulton has done an amazing service to the community by providing detailed and experienced information regarding hard drive tinkering and recovery, including things you would never otherwise consider such as “Live PCB Swap”… watch and learn.
Of course, an ounce of prevention is blah blah blah. Technologies like OS X TimeMachine and ZFS make backup easier and more realistic than ever before and most importantly reduce data duplication significantly. Online backup solutions are good, but frankly are only feasible on very high speed lines in this era where a trip to the beach can result in 2GB’s of new pictures. What I like best is the emergence of wi-fi USB Drive solutions that allow solutions like TimeMachine to backup whenever it likes without specially being hooked up to a drive… the more you back up the less there is to back up and the less hassle it is.
As a closing note… recovering the data is only part of the solution. I’ve found that some Apple apps like iPhoto and iTunes can be very unhappy when you attempt to import into a new system install. For instance, attempts to open my old iPhoto library have been unsuccessful. Thankfully I found iPhotoExtractor. As for iTunes, sadly iPods are not a backup solution… when attaching to a new system, even after authorizing, you may be told you need to delete and resync. In those cases, Sci-Fi Hi-Fi’s PodWorks can come to the rescue allowing you to extract and import music from otherwise unusable iPods.
I am in no way qualified to speak with regard to whats going on around Blastwave, but given that there are no official statements and lots of rumors and speculation I’ll share some info. Please note that I consider both Phil Brown and Dennis Clark as friends and am completely impartial, at least for this post.
If you’ve visited Blastwave.org recently you’ll notice that parts of the site are “missing”, you may also have noticed that Genunix was down for a bit. Threads regarding this issue have appeared both on OpenSolaris Discuss and comp.unix.solaris. A nice lengthy discussion was had with Dennis Clark today in #opensolaris, however I shall not quote that conversation directly, over 250 people were in the room at the time.
In the simplest possible terms, Dennis Clark and Phil Brown are in the process of parting ways. Phil Brown is the creator of the CSW (Community SoftWare) effort including the popular apt-get like pkg-get which has been the center of what people think of Blastwave. Dennis Clark is the man behind Blastwave which has provided the large infrastructure and pool of resources around CSW. These two have been intertwined for a very long time and, due to difference over future direction has escalated into separation. Because of this intemate relationship between CSW and Blastwave over so many years this disconnect does, in many ways, resemble a messy divorce.
As seen in Dennis Clarks email to OpenSolaris Discuss lawyers are involved and action is reportedly being taken against Phil Brown. Furthermore, Blastwave Inc is in progress of incorporating in Canada. CSW now has a new home suncsw.de in Germany.
Blastwave is still in operation for now, and thanks to the various Blastwave/CSW package mirrors users should have been minimally impacted, if at all.
For more information please read through the 3 message threads linked above so you can read from the participants themselves. Phil and Dennis are both active members of the community, if you have questions or concerns I encourage you to respectfully and courteously contact them directly via email, lists, or IRC.
In closing, I will remind everyone that both Dennis and Phil have done amazing things for the Solaris community for many years, and they both have intentions (it would appear) to continue doing so; albeit separately. For years and years they provided the Solaris community, unrecognized by Sun at the time, with high quality software on both SPARC and X86. Regardless of which side you take or which story you read never forget that both Dennis Clark and Phil Brown, together with countless maintainers and volunteers, deserve our lasting respect and appreciation. I’m personally saddened to see the division, but thankful for what they accomplished together.
The first annual OpenSolaris Storage Summit is coming on Sunday, September 21st, to San Jose, to be followed by SNIA’s Storage Developer Conference. This is really exciting. SNIA SDC is one of the best storage conferences in the world (along with USENIX FAST), and OpenSolaris is undoubtedly the more powerful storage platform on earth… this is an excellent opportunity to get a lot of excellent and rewarding information in a week and get to meet the minds behind the technology in OpenSolaris.
Among the speaker lineup, myself and Mike Shapiro are the keynote speakers. If you have any interest in the future of storage you do not want to miss Mike Shapiro! Trust me. You don’t.
As for my keynote, I would love to get feedback from everyone on what you’d like me to talk about. I have some ideas on what I could speak about, but I really want to provide as much value to the attendees as possible, so if you have topics you want discussed please send them my way.
Among the topics I will address is the impact of “cloud computing” on the modern storage infrastructure, how it relates to “Open Storage”, and where things are heading in the future. Storage is, hands down, the most complex aspect of the cloud infrastructure and we’ll dig our heals into it together.
The event is free, so you have no excuse not to attend! Just add your name to the registration list and your in! Pricing for SDC to follow is very reasonable and a very good investment in yourself and your company. If your in the Bay Area you’ve got to attend… if your outside of the area, tell your boss that Ben Rockwood said you need to be there, they’ll give in right away.