Anatomy of an Outage – Part 2

From 3:15AM CDT to 6:05PM CDT, on Friday, September 30th, 2011 Weblogs.us experienced an outage on one of its servers. This is part two of two the story behind the outage, which covers the the events of Friday (e.g. fixing the mess).

Luckily, as I have installed Gentoo several times in the past, and fixed borked installs several times in the past, I knew a path forward. Just boot into a live CD, mount the partitions, download a new base system and chroot. This essentially amounts to a “full reinstall”, though I didn’t have to delete the /etc/ directory or where all the user’s data is stored.

Unfortunately, since the VM was now living on ESXi (really cool software by the way), I was not familiar with mounting ISO images. It turned out to be incredibly easy to do, once I had the proper credentials. The next roadblock came from trying to RDC from one server into another. This causes the console in vSphere to remain blank (in some circumstances). Connecting directly to the server with vSphere installed was the easy fix for this.

Once in the live CD, and before overwriting anything, I felt it was a good time to backup the important configuration files and user data. The configuration files didn’t take long to backup, the user data, on the other hand, took forever. By the time it was done backing up, it was mid afternoon. Now is a good time to mention that I was supposed to be traveling that evening. An hour later all of the base software had been compiled, boot loader configured, and ready to boot into the ‘fixed’ OS.

Once booted into the real install, getting Apache and PHP back up and running was quick and fairly painless. Just a little merging of config files and we were good to go. After going though all of this, it is easy to forget the initial issue. Luckily, this entire ordeal did resolve the issue where the WordPress updater couldn’t contact the FTP server. So, in a roundabout way, this was “Mission Accomplished”.

Lessons Learned

Lessons to take away from this ordeal are: update frequently, don’t use cron jobs to do a periodic reboot, and don’t begin a project when you are getting sick. All three are really obvious.

By updating frequently, I mean for servers update monthly. Take a little time to keep up with updates each month will save a 14 hour ordeal sometime in the future.

Cron jobs are very cool things, however, using them as a preemptive watchdog timer is not appropriate for a server. Rebooting daily is not appropriate for a server either. They are hacky “fixes” to problems that are much deeper, and should be fixed properly. Otherwise, they will come back and bite you at a later time.

As for the sick thing, due to high winds in my area that Thursday, I had initially attributed my feeling under the weather those days to allergies. It ended up being an actual cold. I always find colds distracting. The symptoms make it difficult to focus on the task at hand, resulting in poor decisions being made. Getting extra sleep that Thursday night is what I should have done.

-John Havlik

[end of transmission, stay tuned]

2 thoughts on “Anatomy of an Outage – Part 2

Comments are closed.