Monday, December 05, 2005

Long time no chat

It's been a sickeningly full several months. I was working on an article on Love, Like, and In Love, but have been side-tracked

ARG! A few months ago, a customer began having odd issues with a product we designed and sold them. The appliance we developed has done rather nicely in the space, but this box seems abused. Such a turn of events, may I not witness again.

At the end of August I got a note saying that they had to reboot the box and the data that came up was missing the past week. Restoring from backup resolved the immediate concern, but the issue was strange elusive. It turns out one of a mirrored pair of hard drives failed, and the mirror had been out of sync for about a week.

So I decide to buy a complete new system to have all known goods to swap out... After dorking around with my supplier for a couple weeks trying to help them figure out their 455 from a hole in the ground, I get the dreaded call.... My customer's other drive failed. I found myself doing a complete restore from backup, on a new drive I had to purchase from BestBuy. :(

So I install the new drive, awaiting a new pair of drives to mirror. Unfortunately I created the RAID and LVM constructs using newer tools (Kubuntu-Live CD), which default to creating versions not supported my the kernel on the appliance... DOH! Starting off with the original Install CD and recovering that way ended up giving me the environment I wanted. Once I nailed that down, the restore process came off fine, except a few rights issues I ran into, probably due to a command line switch I didn't include... and this turned into a revamping of the documentation and scripts to recover from backup. All is well on that front now, and I've had several more times of vetting the new process to make sure. Several unbillable hours burned

So the drives come in, and I get to nervously rerun the compute restore process from scratch. I did this partly to test the process further, but also to make sure no hidden rights issues remained. More unbillable hours burned.

So I start thinking we're out of the woods... until I realize that the system doesn't work with all four drives installed and plugged in! What!? Troubleshooting turned up that if I disconnect power from any of them, the other three work great. With all four, I get strange errors in the logs. Fine, I have to get the unit back onsite and operational immediately, so we run on 3 drives for a while. A couple days later I replace the power supply. Everything works great. MANY unbilled hours.

So I still attempt to purchase a complete new system, that process having its own headaches. Between being nonresponsive and not being able to figure out how much to charge me, I decide to go with another vendor, one I've had some history with in the past and with comparable prices. I've learned something from all this, and that is that 3 year warrantees are a "Good Thing". Unfortunately, at this point in the process, I haven't learned that spending the additional $$$ to have the system assembled and tested is as well. The first unit is DOA. Tech support thinks the issue is BIOS not recognizing the processor. I send back the memory, processor, and mobo for a BIOS upgrade and testing. Because I didn't check the "Assembly and Testing" box originally when I ordered it, I pay shipping. I receive the parts back in about a week. They say it works for them, it no worky for me. I swap in a known good power supply. Still no worky. Now I'm significantly confused, and about to send the whole thing back and order another unit with assembly and testing. ARG! None of this is billable, just plain customer service. And not very timely, at that!

So then I get a call from the customer, the machine locked up during the night. A reboot doesn't fix it because the BIOS, seeing the primary drive disappear once, decides it isn't there and won't boot it. One simple fix later, I start to realize that the system locks up if I leave both backup drives (this is a second RAID Array, used for live backups). New problem! Drive controller issue. At this point, I disconnected the Primary mirror drive (in lead-climbing terms, I was setting an anchor to catch me if I fell). I then proceeded to disconnect the backup mirror drive as well. One primary, one backup drive. And things are happy... for the moment. And only a couple unbillable hours burned.

So I then get a phone call that the email subsystem was not delivering new mail! Apparently the hard lockup had stopped some indexing process to corrupt a database in the Cyrus email database. Troubleshooting this proved problematic, as the documentation does not delve into those depths of the inner workings of Cyrus, and I was under the gun. After many hours of looking, and finally finding a good IRC support bunch o' Cyrus hackers on freenode (#cyrus, very respectable), and getting feedback about some IRC naughties I was doing.... ;) I had learned a great deal about the inner-workings of the Cyrus IMAP Server. Maybe more on that later. I ended up whacking the meta-data and restoring from backup (funny how that works... backup locked it up, backup restored it... ;) Several unbillable hours burned.

So the end of month came, and the full monthly backup. This is intense. Thus, even though I was going to replace the entire system, I installed a new SATA drive controller and monitored the progress. December 1st came and went and the box continued to tick. The Monthly backup came off without a hitch. Happiness.

So then on December 3rd I became rather flustered to find the box was no longer sending Backup status emails. Further investigation showed that the Backup drive was no longer responding. 5**7! Thankfully, with all this work going on, the customer had temporarily given me a key, so I visited the site after church on Sunday, with two new backup drives. I was able to bring the drive back online, and upon doing so
I immediately installed one of the replacement drives as a mirror, booted into maintenance mode and started the sync-ing process, hoping to salvage the historical backups. After a while of making sure the sync process wouldn't lock up the box (as it had consistently ever so recently) I started the rest of the services (except cron) and left. A few unbillable hours lost.

So I get home, the box is still responding (ssh is a wonderful thing), slow as all getout. I had forgotten I left cron turned off (to keep the timed events from interfering with the recovery time, etc...), and found that the backups didn't happen. I started the DAILY backup about 7 this morning, and got a call from the customer that things were slow and emails weren't coming in... DOH! cron is responsible to kick off the fetchmail process. Turning cron back on and both backups and fetchmail are not operational... The [raid1] and [mdrecoveryd] processes aren't eating all the CPU, load-average is still at 5.xx, and I can't yet get a status on the RAID array. I'm back to seeing the "hda: lost interrupt" messages. 5**7! I should've used rsync. That drive has got to go...

So you see I've been using an excruciatingly annoying literary faux pas, beginning each paragraph with "So...". This is intentional, to help you get the feeling of frustration and annoyance I've been struggling with for the past three months. Nuf sed.

And this is mixed with family time, day-job, reverse-engineering, creating version 2.0 of this appliance - based on Ubuntu, and prepping/running a Local Mentor Program class for the almighty SANS. I used to remember a very depressing time of being bored... now I only remember that is happened. I'd still take now over then. This may be a post of frustration, but I am also thankful for the life I've been blessed with. Great family, great friends, amazing hobbies, relative business success, and the many faculties I cannot claim responsibility for. I thank the creator for them, and am thankful in all circumstances.

Kurios