Learning from Broken Equipment and Minor Mistakes

A few days ago, I noticed that websites were loading very slowly, particularly in the early stages.  It appeared as if there were problems with the DNS service being provided by my internal storage server.  I tried to SSH into the machine to do some investigation and access the Webmin web interface; neither option worked.  However, I was able to receive replys to pings sent to the server.  I knew something was up, but I would have to dig in to figure out exactly what.

After seeing these symptoms and a machine that is totally unresponsive, I chose to blindly reboot the machine, as I was running the server headless.  Not something I would ever do in a production environment, but it’s my home server.  After a reboot, the server wouldn’t even respond to ping requests, and I saw in my router that it had not registered its DHCP lease upon boot.  My obvious next step was to connect a monitor to this machine and actually see what was going on.  As I moved the machine around to plug in the monitor cable, I noticed that the steel case was warm (borderline hot, in fact) to the touch.  Once the monitor was connected, I immediately saw the root of the problem.  I was staring at a BIOS screen telling me the boot drive had failed.

Knowing what the problem was made for a clear path for recovering.  I located a replacement IDE hard drive and swapped it into the case and reloaded a stock installation of Debian 6, Squeeze.  Recreating the basic installation was no problem, and I added Webmin back onto the system.  Fortunately, I had no user data on the orignal hard drive, and I had kept the data volume in its own volume group, to avoid problems like this, should they ever come up.  The only parts of the configuration that were located on the original drive were mounting, NFS, Samba, and DNS configurations.  On the other hand, I had not gotten around to backing up that configuration to another machine to avoid such a problem.  ;-)

After getting a new boot drive with an OS on it, I moved to getting the data volume up and running.  Based upon my previous post, the data lives inside of a LVM logical volume on top of a physical RAID1 array.  Since I was simply trying to locate and re-enable an existing array, I ran: “mdadm –assemble –scan” which found the two partitions in the RAID1 and activated them.  Since they had not been touched in quite some time, no resync was even necessary.

The next step was to locate and reactivate the logical volume so that I could actually mount and use the data stored on the volume.  I ran each of the LVM *scan commands, “vgscan”, “pvscan”, and “lvscan” to confirm the metadata had been preserved.  Everything looked good, so I just had to reactivate the volume with the “vgchange -y a” command, which tells LVM to activate all available volume groups and logical volumes.  I was finally able to mount the data drives and verify that all of the expected data was there.

Now that the data drives were healthy, I had to finish recreating the lost configurations.  I added a permanent entry in /etc/fstab to mount the volume at boot, along with my NFS exports.  Also, I added the local users to the system and gave Samba passwords.  My only time-consuming task was to re-enter the forward and reverse DNS entries for my internal machines and set the DNS forwarders.

Now that the configuration was back to where I wanted it, I took the extra step of using Webmin’s backup module to export all of the configuration information from the system to my laptop.

Once I really knew what was hapenning, the fix really didn’t take that long.  I spent about 4 hours between loading the OS, reactivating the data drives, and getting the local services set back up again.  With the added knowledge of RAID and LVM recovery in Linux, as well as configuration backups, it would probably only take about an hour; most of that time would be waiting on the OS to install to the drive.

Leave a Reply