Re: regression caused by block: freeze the queue earlier in del_gendisk
From: Jens Axboe
Date: Tue Sep 20 2022 - 10:05:49 EST
On 9/20/22 3:11 AM, Thorsten Leemhuis wrote:
> Hi, this is your Linux kernel regression tracker.
>
> On 13.09.22 04:36, Dusty Mabe wrote:
>> On 9/12/22 21:55, Ming Lei wrote:
>>> On Mon, Sep 12, 2022 at 09:16:18AM +0200, Christoph Hellwig wrote:
>>>> On Fri, Sep 09, 2022 at 04:24:40PM +0800, Ming Lei wrote:
>>>>> On Wed, Sep 07, 2022 at 09:33:24AM +0200, Christoph Hellwig wrote:
>>>>>> On Thu, Sep 01, 2022 at 03:06:08PM +0800, Ming Lei wrote:
>>>>>>> It is a bit hard to associate the above commit with reported issue.
>>>>>>
>>>>>> So the messages clearly are about something trying to open a device
>>>>>> that went away at the block layer, but somehow does not get removed
>>>>>> in time by udev (which seems to be a userspace bug in CoreOS). But
>>>>>> even with that we really should not hang.
>>>>>
>>>>> Xiao Ni provides one script[1] which can reproduce the issue more or less.
>>>>
>>>> I've run the reproduced 10000 times on current mainline, and while
>>>> it prints one of the autoloading messages per run, I've not actually
>>>> seen any kind of hang.
>>>
>>> I can't reproduce the hang too.
>>
>> I obviously can reproduce the issue with the test in our Fedora CoreOS
>> test suite. It's part of a framework (i.e. it's not simple some script
>> you can run) but it is very reproducible so one can add some instrumentation
>> to the kernel and feed it through a build/test cycle to see different
>> results or logs.
>>
>> I'm willing to share this with other people (maybe a screen share or
>> some written down instructions) if anyone would be interested.
>
> This thread looked stalled, or was there any progress in the past week?
> If not: Fedora apparently removed the patch in their kernels a while
> ago, as quite a few users where hitting it. What is preventing us from
> doing the same in mainline and 5.19.y until the issue can be resolved?
> The description of a09b314005f3 ("block: freeze the queue earlier in
> del_gendisk") doesn't sound like the change does something crucial that
> can't wait a bit. I might be totally wrong with that, but I think it's
> my duty to ask that question at this point.
Christoph and I discussed this one last week, and he has a plan to try
a flag approach. Christoph, did you get a chance to bang that out? Would
be nice to get this one wrapped up.
--
Jens Axboe