Re: regression caused by block: freeze the queue earlier in del_gendisk

From: Thorsten Leemhuis
Date: Tue Sep 20 2022 - 05:13:22 EST


Hi, this is your Linux kernel regression tracker.

On 13.09.22 04:36, Dusty Mabe wrote:
> On 9/12/22 21:55, Ming Lei wrote:
>> On Mon, Sep 12, 2022 at 09:16:18AM +0200, Christoph Hellwig wrote:
>>> On Fri, Sep 09, 2022 at 04:24:40PM +0800, Ming Lei wrote:
>>>> On Wed, Sep 07, 2022 at 09:33:24AM +0200, Christoph Hellwig wrote:
>>>>> On Thu, Sep 01, 2022 at 03:06:08PM +0800, Ming Lei wrote:
>>>>>> It is a bit hard to associate the above commit with reported issue.
>>>>>
>>>>> So the messages clearly are about something trying to open a device
>>>>> that went away at the block layer, but somehow does not get removed
>>>>> in time by udev (which seems to be a userspace bug in CoreOS). But
>>>>> even with that we really should not hang.
>>>>
>>>> Xiao Ni provides one script[1] which can reproduce the issue more or less.
>>>
>>> I've run the reproduced 10000 times on current mainline, and while
>>> it prints one of the autoloading messages per run, I've not actually
>>> seen any kind of hang.
>>
>> I can't reproduce the hang too.
>
> I obviously can reproduce the issue with the test in our Fedora CoreOS
> test suite. It's part of a framework (i.e. it's not simple some script
> you can run) but it is very reproducible so one can add some instrumentation
> to the kernel and feed it through a build/test cycle to see different
> results or logs.
>
> I'm willing to share this with other people (maybe a screen share or
> some written down instructions) if anyone would be interested.

This thread looked stalled, or was there any progress in the past week?
If not: Fedora apparently removed the patch in their kernels a while
ago, as quite a few users where hitting it. What is preventing us from
doing the same in mainline and 5.19.y until the issue can be resolved?
The description of a09b314005f3 ("block: freeze the queue earlier in
del_gendisk") doesn't sound like the change does something crucial that
can't wait a bit. I might be totally wrong with that, but I think it's
my duty to ask that question at this point.

>> What I meant is that new raid disk can be added by mdadm after stopping
>> the imsm container and raid disk with the autoloading messages printed,
>> I understand this behavior isn't correct, but I am not familiar with
>> raid enough.
>>
>> It might be related with the delay deleting gendisk from wq & md kobj
>> release handler.
>>
>> During reboot, if mdadm does this stupid thing without stopping, the hang
>> could be caused.
>>
>> I think the root cause is that why mdadm tries to open/add new raid bdev
>> crazily during reboot.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.