Anatomy of an Outage – Part 1

From 3:15AM CDT to 6:05PM CDT, on Friday, September 30th, 2011 experienced an outage on one of its servers. It all stemmed from a simple support request, and a very hacky fix implemented back in July (while I was on vacation). This is part one of the story behind the outage, and you can blame me for it.

This story begins on Wednesday, September 28th, 2011, when we had issues with a database server. The particular server has caused headaches earlier this year and prompted the purchase of some newer hardware (which happens to be hosting the Apache VM at the moment). JD moved the affected databases to a different database server. At this time, WordPress’ built in plugin updater stopped working. The result was an “Failed to connect to FTP Server” message in WordPress. A user first brought this issue forward on the 29th.

Since I found myself with a little free time on the 29th, I decided to look into the issue. Almost immediately, I could reproduce the issue on this site. As this site uses a different database server, yet experienced the same problem, it appeared to be a FTP server issue. Looking at the logs, the FTP daemon was never even receiving a connection attempt from WordPress.

The easy thing to do was update the FTP server. That went well until it came to merging configuration scripts and startup scripts. I foolishly deleted the old startup script. Since we had not migrated to OpenRC yet on that VM, while the ebuild had. Thus, the new startup script was not compatible with our server, and I could not start up the FTP daemon.

Migrating from sysinit to OpenRC does require updating quite a few packages. The kernel had to be updated to install a new udev required for OpenRC. Plus, there were some circular dependencies that had to be resolved. Staying up late to babysit this process was not an option due to an oncoming illness. Idiotically, I uninstalled sysinit before getting a new kernel installed (uninstalling sysinit should have been the last thing I did, right before installing OpenRC). This left the system in a state such that it could not come back up on reboot.

Due to an issue with VMWare server, and the host hardware, back in late July, a cron job was setup to reboot the VM at 3:15AM every day. Yes, this is something you should never do (just like you should never rely on watchdog timers for normal device operation). When the VM was moved to the new host hardware (and over to ESXi), the cron job should have been removed. However, I hadn’t had the time to do that yet (blame the project that ate my summer).

So, leaving the server in a state that could not come back up after a reboot, and a pending cronjob to reboot at 3:15AM was a recipe for disaster. Waking up on Friday, I went to check on the server. And, it had rebooted, attempted to start up, but without an init system, it couldn’t start any of the userland programs. Crap.

-John Havlik

[end of transmission, stay tuned]