In theory, it should be simple. One of the disks, the /dev/hdd, in my raid5-array on a linux machine (an old dell poweredge server) had failed. As far as I understood, it should just be to turn of the machine (since i don’t have hot-swap-able hardware) get out the old disk. insert a new of the same size or bigger and reboot… But then disaster struck… According to SMART, one of the disk had a real good ol’ hardware error, but then the other disk on the same controller was marked as defective, or by some other way taken out of the array, so then my nice three-disk RAID5 was reduced to one unusable disk.
Looking into the s.m.a.r.t. info on each of the failed disks, it seemed that one of them had failed, but the other had at least no smart-errors, but then, how to get a failed disk back in the array? I have found quite a few lessions telling you to “fail a disk” and then take it out of the array, but not how to get that disk back again…
That turned out to be as easy as to rebuild the array:
# /sbin/mdadm --create --verbose /dev/md0 --level=5 --raid-devices=3 /dev/hdb1 /dev/hda1 /dev/hdf1
For a while I was reluctant to do that, as I was afraid it would erase the array and I had some info on the array that was not backed up, but no danger. In fact I did that after I had removed the failed disk and put in a new one. What I probably should have done, was just a
# mdadm -a /dev/hdc1
But the rebuild also did the trick to get back the /dev/hdc1 which had not failed, but anyhow was thrown out. But now a mdadm --detail /dev/md0
showed that /dev/hdb1 and /dev/hdc1 was working well in the array, but disk Number 2 was missing and my brand new /dev/hdd1 was still just a spare, no attempt being done to get it into the array. A lot of googling did not turn up any way to get a spare to be used as an active disk in the array. Probably the array discovers after a while that a disk is missing and a spare is waiting, but to me it seemed as if the example command in the man page on doing several commands in one mdadm might do what I needed, so I did:
mdadm /dev/md0 -f /dev/hdd1 -r /dev/hdd1 -a /dev/hdd1
and got an error message connected to the last -a, after a mdadm /dev/md0 -a /dev/hdd1 everything looked fine and the array started to sync /dev/hdd1 with the rest.
Then the next problem was to find the file system again. After googling and looking around a lot, restarting the md system and lvm system nothing turned up, so I gave in and restarted the machine. 🙁 Then, for the first time since the crash, it found all volumes and came up without a hitch.
Looking back, the next time a drive fails, I will do as follows (given /dev/hdc1 fails)
# if any /dev/hdd1 is taken offline without reason
mdadm /dev/md0 -a /dev/hdd1
mdadm /dev/md0 -r /dev/hdc1
# Halt machine, remove failed disk set up new disk
mdadm /dev/md0 -a /dev/hdc1
and then, hopefully, the filesystem is still there, else I have to look a bit more in howto resurrect a lvm file system or just reboot the system.
Disclaimer: This worked great for me, no guarantee that it works for anyone else. If the other disk really had been out, I would have lost all data since my last backup….
My main sources of information for this was Managing RAID and LVM with Linux (v0.5) and the mdadm man page. I was using mdadm – v2.6.7.1 – 15th October 2008 and debian’s kernel 2.6.26-1-686.