mdadm unable to stop RAID device after disk failure

From: Adam Nielsen
Date: Sat Sep 12 2015 - 21:49:42 EST


Hi all,

I'm having some problems trying to work out how to get mdadm to restart
a RAID array after a disk failure. It is refusing to close the array
saying it's in use, and it's refusing to let me start the array again
saying the devices are already part of another array:

$ mdadm --manage /dev/md10 --stop
mdadm: Cannot get exclusive access to /dev/md10:Perhaps a running
process, mounted filesystem or active volume group?

$ mdadm --manage /dev/md10 --fail

$ mdadm --manage /dev/md10 --stop
mdadm: Cannot get exclusive access to /dev/md10:Perhaps a running
process, mounted filesystem or active volume group?

$ cat /proc/mdstat
Personalities : [raid0]
md10 : active raid0 sde1[0] sdd1[1]
5860268032 blocks super 1.2 512k chunks

Why is it still telling me the array is active after I have tried to
mark it failed? If I try to specifically list one of the devices that
make up the array, that doesn't work either:

$ mdadm --manage /dev/md10 --fail /dev/sdd1
mdadm: Cannot find /dev/sdd1: No such file or directory

This is because /dev/sdd doesn't exist anymore, as it's an external
drive so when I replugged it it became /dev/sdf. The manpage says you
can use the special word "detached" for this situation, but that doesn't
work either:

$ mdadm --manage /dev/md10 --fail detached
mdadm: set device faulty failed for 8:65: Device or resource busy

8:65 corresponds to /dev/sde1, so it appears to get the right device
but why is it busy? Isn't the point of --fail to simulate a drive
failure, which could occur at any time, even if a drive is busy?

The two disks (sdd and sde) reappeared as sdf and sdg after replugging,
so I thought I could just create the array and ignore the old failed
one:

$ mdadm --assemble /dev/md11 /dev/sdf1 /dev/sdg1
mdadm: Found some drive for an array that is already
active: /dev/md/10
mdadm: giving up.

I'm not sure how it considers the drive part of an active array, when
it's a different device. I guess it's matching serial numbers or
something which is wrong in this case. Although it wouldn't be a
problem if there was some way to remove the old array that is refusing
to die!

Is there any way to solve this problem, or do you just have to reboot a
machine after a disk failure?

Thanks,
Adam.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/