This post requires two major disclaimers:
- I am not an engineer. I am a relatively technical sales & marketing guy. I have my own Small Business Server-based network at home, and I know enough about Microsoft Operating Systems to be able to muddle through most of what gets thrown at me. And, although I’ve done my share of friends-and-family-tech-support, you do not want me working on your critical business systems.
- I am not, by any stretch of the imagination, a Linux guru. However, I’ve come to appreciate the “LAMP” (Linux/Apache/MySQL/PHP) platform for Web hosting. With apologies to my Microsoft friends, there are some things that are quite easy to do on a LAMP platform that are not easy at all on a Windows Web server. (Just try, for example, to create a file called “.htaccess” on a Windows file system.)
Some months ago, I got my hands on an old Dell PowerEdge SC420. It happened to be a twin of the system I’m running SBS on, but didn’t have quite as much RAM or as much disk space. I decided to install CentOS v5.4 on it, turn it into a LAMP server, and move the four or five Web sites I was running on my Small Business Server over my new LAMP server instead. I even found an open source utility called “ISP Config” that is a reasonable alternative – at least for my limited needs – to the Parallels Plesk control panel that most commercial Web hosts offer.
Things went along swimmingly until last weekend, when I noticed a strange, rhythmic clicking and beeping coming from my Web server. Everything seemed to be working – Web sites were all up – I logged on and didn’t see anything odd in the system log files (aside from the fact that a number of people out there seemed to be trying to use FTP to hack my administrative password). So I decided to restart the system, on the off chance that it would clear whatever error was occurring.
Those of you who are Linux gurus probably just did a double facepalm…because, in retrospect, I should have checked the health of my disk array before shutting down. The server didn’t have a hardware RAID controller, so I had built my system with a software RAID1 array – which several sources suggest is both safer and better performing than the “fake RAID” that’s built into the motherboard. Turns out that the first disk in my array (/dev/sda for those who know the lingo) had died, and for some reason, the system wouldn’t boot from the other drive.
This is the point where I did a double facepalm, and muttered a few choice words under my breath. Not that it was a tragedy – all that server did was host my Web sites, and my Web site data was backed up in a couple of places. So I wouldn’t have lost any data if I had rebuilt the server…just several hours of my life that I didn’t really have to spare. So I did what any of you would have done in my place – I started searching the Web.
The first advice I found suggested that I should completely remove the bad drive from the system, and connect the good drive as drive “0.” Tried it, no change. The next advice I found suggested that I boot my system from the Linux CD or DVD, and try the “Linux rescue” function. That sounded like a good idea, so I tried it – but when the rescue utility examined my disk, it claimed that there were no Linux partitions present, despite evidence to the contrary: I could run fdisk -l and see that there were two Linux partitions on the disk, one of which was marked as a boot partition, but the rescue utility still couldn’t detect them, and the system still wouldn’t boot.
I finally stumbled across a reference to something called “SuperGRUB.” “GRUB,” for those of you who know as much about Linux as I did before this happened to me, is the “GNU GRand Unified Bootloader,” from the GNU Project. It’s apparently the bootloader that CentOS uses, and it was apparently missing from the disk I was trying to boot from. But that’s precisely the problem that SuperGRUB was designed to fix!
And fix it it did! I downloaded the SuperGRUB ISO, burned it to a CD, booted my Linux server from it, navigated through a quite intuitive menu structure, told it what partition I wanted to fix, and PRESTO! My disk was now bootable, and my Web server was back (albeit running on only one disk). But that can be fixed as well. I found a new 80 Gb SATA drive (which was all the space I needed) on eBay for $25, installed it, cruised a couple of Linux forums to learn how to (1) use sfdisk to copy the partition structure of my existing disk to the new disk, and (2) use mdadm to add the new disk to my RAID1 array, and about 15 minutes later, my array was rebuilt and my Web server was healthy again.
There are two takeaways from this story:
First, the Internet is a wonderful thing, with amazing resources that can help even a neophyte like me to find enough information to pull my ample backside out of the fire and get my system running again.
Second, all those folks out there whom we sometimes make fun of and accuse of not having a life are actually producing some amazing stuff. I don’t know the guys behind the SuperGRUB project. They may or may not be stereotypical geeks. I don’t know how many late hours were burned, nor how many Twinkies or Diet Cokes were consumed (if any) in the production of the SuperGRUB utility. I do know that it was magical, and saved me many hours of work, and for that, I am grateful. (I’d even ship them a case of Twinkies if I knew who to send it to.) If you ever find yourself in a similar situation, it may save your, um, bacon as well.