I’ve just come back from a two-day fiasco that is every IT department’s worst-case scenario; a data server crashing with 100% loss of files. If it wasn’t so serious, it’ll be funny.
On Wednesday morning, I was about to leave the Bristol office to go to our Reading office when I received a phone call from my boss. “Birmingham’s server isn’t booting up,” he told me, so I diverted myself north to go fix it.
The server in question was a Hewlett Packard DL360 G5 with two quad-core Xeons and six 300Gb SAS drives configured as a RAID-5 logical drive of about 1.4Tb – a 20Gb system partition and the rest as a data partition – with a LTO-4 tape drive hanging off the external SAS port, running Windows Server 2003. The server was being used primarily to host data for large construction projects. A sledgehammer to crack a walnut – 8 cores on Windows Server 2003 serving Office files across a LAN. Personally, I don’t like HP’s servers, simply because their integrated SAS controllers are generally junk. We had a number of ML110s that developed faulty SAS controllers. To rectify, HP decided to send me wave after wave of tape drives, P212 controllers, and various random cables, all of which didn’t work.
I get to the Birmingham office to witness their server constantly rebooting; it’ll hit the Windows 2003 startup splash screen, then 5 seconds later, reboot. Immediately, I notice that two of the SAS drives are flashing amber – usually meaning SMART monitoring has something to say at the very least. The integrated controller card, a P400i, is telling me that the array is not valid anymore, and that THREE drives are offline due to damage. RAID-5 is weird in that the array can rebuild itself in the event that one disk dies, therefore it is possible to hot-swap a dead drive out, replace it with a good drive, and RAID will fix the array without affecting up-time. Very resilient. However, with three disks crashing, it’s pretty-much a write-off. I don’t believe that three disks would die at once, so I power off the server, pull the drives out a little (I don’t want to remove them fully as I could mix them up), then put it all back and reboot. All the drives come back online, and the POST tells me that the array is back online, but also warns that there is data loss. Still, Windows is not booting up. After a couple of restarts, the array fails again, but only two disks this time. Strange.
At this moment, I’m convinced that the controller card has had a funny five minutes, if not faulty, so I reluctantly call HP. Crisis One appears – the server ran out of warranty in Sept ’11. Keeping a server within vendor warranty, especially hosting user files in a production environment, is pretty-much the first rule of client-server environments. I get no joy with HP, so start considering my options. My default priority is to gain access to the partition containing the user files, so I get SmartStart booted up. Hmm, no existing partitions according to the software. I scavenge a USB floppy drive, get the P400i drivers onto a floppy and boot from a 2003 disk. Windows reports two partitions, but both heavily emptied of data (we’re talking a couple of megabytes in each). The same with 2008R2 – both partitions virtually flattened. I’m starting to worry now. I try Acronis 10 and Echo – it can read the disk but as soon as I try to pull a list of files from either partition, the server hangs for 10 minutes then reboots.
Anyway, I try all combinations of recovery methods and tools, but it does look like the partitions have been completely gutted and the SAS disk in bay 1 is damaged. Our server guy and my boss tell me that the way forward is to take bay one offline, install Windows again using the SmartStart software (not good), load the OS full of updates and then install BackupExec 12.5 and SP4, then start recovering the Tuesday night backup tape, of which I have possession. It’s important to note that prior to my involvement with this, our department spoke to the backup rep (guy responsible for doing the backups each night) and the most recent backup tape was discussed. It was, apparently, a day old, so the users are going to be re-doing Wednesday’s work. Still, could be a lot worse, right? Ha – hold that thought.
Now I don’t like trusting backups as a primary method of recovering data, mainly because I don’t know whether the backup was successful, whether the media is undamaged, whether it breached etc. I stick to my guns and try to recover the data from the damaged array, but at 5pm, I concede defeat and go to reinstall Windows. Firstly, SmartStart is shit. It is supposed to make OS installations seamless and painless, injecting all the necessary drivers into the installation. It does, but is shit. After blue-screening 3 times on the controller driver, I install 2003 the raw way. In the end, it takes me until 9pm to get to the stage of restoring the backup. Crisis Two appears – the backup tape is locked with an AD account I know nothing about. Oh well – at least I can catalogue the tape.
Crisis Three makes itself known, the time when I am now weak from hunger and dizzy from staring at a server all day. The date on the backup says 3rd April ’12. Three weeks ago. Maybe this is the wrong tape? Since it is now 9.15pm and I am 105 minutes away from food and bed, AND I’ll need to come back to this office the next day and as early as possible, I whip off a quick email detailing my findings to the relevant people, then go home.
The next morning, I bump into the backup rep in the car park. I give him a quick run-down of what’s happened; he pales noticeably, and tells me that the Tuesday tape I’ve got was about three weeks old. No-one has been doing the backups. In effect, all the work, projects, data – EVERYTHING – done from the 4th April to the 27th was for nothing. Strangely, the users that I tell this information to aren’t that concerned. Previous experience with other similar (but more successful) scenarios should mean that the users are constantly giving the department grief about timescales, loss of productivity and the such, but the users are already resigned to the fact that there’s data going to be missing. The ONE SAVING GRACE is that some of the major projects in this region were setup on our central storage location, so I actually thanked the Lord for that, else we might as well have shut up shop! To put this in perspective, imagine if the previous three weeks of your emails and Facebook updates were suddenly deleted. Now imagine that happening to a few hundred people.
These are the big lessons to be learnt:
1. Always ensure backups are done. Hardware is incidental and can be swapped/replaced at will, but data and information takes time and resource to create. A 20 second repetitive action at the end of every day, using relatively cheap tapes, is a no-brainer.
2. Monitor backups, and act on any missed events immediately. Although the backups in the branch were missed (I expect this to come out in the post-mortem) for three weeks for reasons unknown, the email alerts sent to the user AND to our department weren’t acted upon. Even a quick email to myself to chase up the backup rep, if not to ask me to put a tape in when I was visiting the branch, would have reduced the impact considerable, even eliminated it.
There is a little side-note to this. I mentioned earlier that some of the big projects were working from a central location? The above scenario can happen to ANY company that has traditional server/client setups across multiple locations, and we decided to elimiate this risk. We are in the final stages of migrating all server data in all our offices into a central SAN store, and installing Riverbed Steelhead network accelerators in place of these local storage servers. The idea is that the data is safe in one location which is heavily backed-up,and the users still get the “data in the local LAN” experience, and we don’t have to rely on users remembering to do tape backups. Sadly, the Birmingham office migration was delayed due to a larger London-based project sucking up all IT resource at the start of the month. Ironic, eh? If the project plan for these data migrations had been adhered to, then this entire problem would never have happened. Also, the guy who is a key component to these data migrations handed his notice in at the end of March, so he’s obviously on a “wind-down” period. Interestingly, he’s ALSO the chap that monitors the backup alerts.