Re: Lockup of (raid5 or raid6) + vdo after taking out a disk under load

From: Yu Kuai
Date: Sun Jul 14 2024 - 21:56:30 EST


Hi,

在 2024/07/13 21:50, Konstantin Kharlamov 写道:
On Sat, 2024-07-13 at 19:06 +0800, Yu Kuai wrote:
Hi,

在 2024/07/12 20:11, Konstantin Kharlamov 写道:
Good news: you diff seems to have fixed the problem! I would have
to
test more extensively in another environment to be completely sure,
but
by following the minimal steps-to-reproduce I can no longer
reproduce
the problem, so it seems to have fixed the problem.

That's good. :)

Bad news: there's a new lockup now 😄 This one seems to happen
after
the disk is returned back; unless the action of returning back
matches
accidentally the appearing stacktraces, which still might be
possible
even though I re-tested multiple times. It's because the traces
(below) seems not to always appear. However, even when traces do
not
appear, IO load on the fio that's running in the background drops
to
zero, so something seems definitely wrong.

Ok, I need to investigate more for this. The call stack is not much
helpful.

Is it not helpful because of missing line numbers or in general? If
it's the missing line numbers I'll try to fix that. We're using some
Debian scripts that create deb packages, and well, they don't work well
with debug information (it's being put to separate package, but even if
it's installed the kernel traces still don't have line numbers). I
didn't investigate into it, but I can if that will help.

Line number will be helpful. Meanwhile, can you check if the underlying
disks has IO while raid5 stuck, by /sys/block/[device]/inflight.

At first, can the problem reporduce with raid1/raid10? If not, this
is
probably a raid5 bug.

This is not reproducible with raid1 (i.e. no lockups for raid1), I
tested that. I didn't test raid10, if you want I can try (but probably
only after the weekend, because today I was asked to give the nodes
away, for the weekend at least, to someone else).

Yes, please try raid10 as well. For now I'll say this is a raid5
problem.

The best will be that if I can reporduce this problem myself.
The problem is that I don't understand the step 4: turning off jbod
slot's power, is this only possible for a real machine, or can I do
this in my VM?

Well, let's say that if it is possible, I don't know a way to do that.
The `sg_ses` commands that I used

sg_ses --dev-slot-num=9 --set=3:4:1 /dev/sg26 # turning off
sg_ses --dev-slot-num=9 --clear=3:4:1 /dev/sg26 # turning on

…sets and clears the value of the 3:4:1 bit, where the bit is defined
by the JBOD's manufacturer datasheet. The 3:4:1 specifically is defined
by "AIC" manufacturer. That means the command as is unlikely to work on
a different hardware.

I never do this before, I'll try.

Well, while on it, do you have any thoughts why just using a `echo 1 >
/sys/block/sdX/device/delete` doesn't reproduce it? Does perhaps kernel
not emulate device disappearance too well?

echo 1 > delete just delete the disk from kernel, and scsi/dm-raid will
know that this disk is deleted. However, the disk will stay in kernel
for the other way, dm-raid does not aware that underlying disks are
problematic and IO will still be generated and issued.

Thanks,
Kuai

.