Monday, November 19, 2012

Ubuntu 12.04 server crash and recovery

Wow, my last blog entry was over an year ago. I've been real busy. Anyway, I had a server crash last Friday night and I had to work hard and careful (yea, right!) to recover it to a good state. I found a few useful resources on the web for this task, so I thought I'd blog about it so that the method is handy for my future use as well as for others. Let's look, first, at what the problem was.

The server is being used for course labs, assignments and exams. Students write, compile and run C++ code on this machine. The server had accumulated some updates and needed to restart to apply them. Friday night, we had a midterm exam where students worked live and interactively on the server to solve the exam questions. After the exam was over, we decided to shut the server down for a few minutes so that the exam end time could be policed and students wouldn't be able to continue working on the problems. After shutting the server down, I came back to my office and turned it back on. But the server halted during the bootup process. I had to recover the server to a working state. I had to do it without losing data. It would be nice if the user account information (passwords) would remain intact. So, those are the parameters of the job. Now let's see what went wrong and then at how I fixed it.

When the server rebooted, it stuck during the boot process complaining about something like "cifs_mount failed with exit code -101". I knew that there is a NAS that I am mounting over SMB through /etc/fstab, so that must've been the culprit. But how do I remove that line from /etc/fstab when the server wouldn't boot up. After googling, I determined that one get the system to drop me at a root shell prompt by editing the kernel arguments at the GRUB boot prompt. So, I hit 'e' to edit the kernel parameters and typed "init=/bin/bash" at the end of the kernel line. The machine started and dropped to the root shell. But when I tried to edit the /etc/fstab file, it was read-only. Of course, the hard drive hadn't been mounted in read-write mode, yet. Some more googling revealed that I should do a "mount -o remount,rw /". Having done that, I was able to get the line that mounts the NFS commented out of /etc/fstab. Reboot, but still, the boot didn't complete and the server was stuck at an error message which complained of pre-start process terminating with some error code. I did some more googling which led me to a bug report about upstart in a past versioin of ubuntu. But apparently, that bug didn't apply to the version I am using (12.04).

So, it seemed that recovering without a reinstall was out of the question. To get data backup, I booted with a Ubuntu desktop live CD. But how do I mount the server's hard drive. For that, I followed the steps on this page. Then, I connected my NAS on a USB port and did the following:

sudo mkdir /mnt/lg
sudo mount /dev/sdb1 /mnt/lg

Then, I copied all user folders to the NAS:

cd /home
tar /mnt/lg/homes.tar.gz *

Having done that, I also backed up all contents of the /var/log and /var/www folder onto the NAS. To get backup of the user accounts, I followed the steps outline on this page. Note that you'd have to intersperse sudo with most commands on that page. Also, where the page talks about piping one command's output to another, I had to use a sudo at the beginning of the command as well as somewhere in between. I guess we can do a hit and trial next time, too, until we stop getting a permission denied.

Finally, I unmounted the NAS

sudo umount /mnt/lg

and rebooted the server with the 12.04 server install CD in the drive. I re-installed the server and selected to use the existing LVM partition on the hard drive without formatting it, so all data was safe anyway. I decided to turn off automatic udpates this time. Then, after the server booted up successfully, I connected my NAS and mounted it. Then, I followed the restore instructions at this page.

That's it! The server was back online, with all passwords same as before and no data loss. Of course, I had to install g++ and ncurses again. Plus, I made a few mess ups in between so I had to follow the same steps as above one more time almost from scratch. Also, when the server crashed and I had the path traced out, it was 11:30 pm before I could do anything useful and I had a class early morning for which I hadn't even started preparing, so I left for home and only started working on the server recovery at about 2 pm next day. So, the server was offline for about a full day.