Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected
From: Chris Clayton
Date: Sat Feb 18 2023 - 07:22:34 EST
On 15/02/2023 11:09, Karol Herbst wrote:
> On Wed, Feb 15, 2023 at 11:36 AM Linux regression tracking #update
> (Thorsten Leemhuis) <regressions@xxxxxxxxxxxxx> wrote:
>>
>> On 13.02.23 10:14, Chris Clayton wrote:
>>> On 13/02/2023 02:57, Dave Airlie wrote:
>>>> On Sun, 12 Feb 2023 at 00:43, Chris Clayton <chris2553@xxxxxxxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>>>> On 10.02.23 20:01, Karol Herbst wrote:
>>>>>>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
>>>>>>> Leemhuis) <regressions@xxxxxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> On 08.02.23 09:48, Chris Clayton wrote:
>>>>>>>>>
>>>>>>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
>>>>>>>>
>>>>>>>> Yeah, looks like it. That's unfortunate, but happens. But there is still
>>>>>>>> time to fix it and there is one thing I wonder:
>>>>>>>>
>>>>>>>> Did any of the nouveau developers look at the netconsole captures Chris
>>>>>>>> posted more than a week ago to check if they somehow help to track down
>>>>>>>> the root of this problem?
>>>>>>>
>>>>>>> I did now and I can't spot anything. I think at this point it would
>>>>>>> make sense to dump the active tasks/threads via sqsrq keys to see if
>>>>>>> any is in a weird state preventing the machine from shutting down.
>>>>>>
>>>>>> Many thx for looking into it!
>>>>>
>>>>> Yes, thanks Karol.
>>>>>
>>>>> Attached is the output from dmesg when this block of code:
>>>>>
>>>>> /bin/mount /dev/sda7 /mnt/sda7
>>>>> /bin/mountpoint /proc || /bin/mount /proc
>>>>> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
>>>>> /bin/echo t > /proc/sysrq-trigger
>>>>> /bin/sleep 1
>>>>> /bin/sync
>>>>> /bin/sleep 1
>>>>> kill $(pidof dmesg)
>>>>> /bin/umount /mnt/sda7
>>>>>
>>>>> is executed immediately before /sbin/reboot is called as the final step of rebooting my system.
>>>>>
>>>>> I hope this is what you were looking for, but if not, please let me know what you need
>>>
>>> Thanks Dave. [...]
>> FWIW, in case anyone strands here in the archives: the msg was
>> truncated. The full post can be found in a new thread:
>>
>> https://lore.kernel.org/lkml/e0b80506-b3cf-315b-4327-1b988d86031e@xxxxxxxxxxxxxx/
>>
>> Sadly it seems the info "With runpm=0, both reboot and poweroff work on
>> my laptop." didn't bring us much further to a solution. :-/ I don't
>> really like it, but for regression tracking I'm now putting this on the
>> back-burner, as a fix is not in sight.
>>
>> #regzbot monitor:
>> https://lore.kernel.org/lkml/e0b80506-b3cf-315b-4327-1b988d86031e@xxxxxxxxxxxxxx/
>> #regzbot backburner: hard to debug and apparently rare
>> #regzbot ignore-activity
>>
>
> yeah.. this bug looks a little annoying. Sadly the only Turing based
> laptop I got doesn't work on Nouveau because of firmware related
> issues and we probably need to get updated ones from Nvidia here :(
>
> But it's a bit weird that the kernel doesn't shutdown, because I don't
> see anything in the logs which would prevent that from happening.
> Unless it's waiting on one of the tasks to complete, but none of them
> looked in any way nouveau related.
>
> If somebody else has any fancy kernel debugging tips here to figure
> out why it hangs, that would be very helpful...
>
I think I've figured this out. It's to do with how my system is configured. I do have an initrd, but the only thing on
it is the cpu microcode which, it is recommended, should be loaded early. The absence of the NVidia firmare from an
initrd doesn't matter because the drivers for the hardware that need to load firmware are all built as modules, So, by
the time the devices are configured via udev, the root partition is mounted and the drivers can get at the firmware.
I've found, by turning on nouveau debug and taking a video of the screen as the system shuts down, that nouveau seems to
be trying to run the scrubber very very late in the shutdown process. The problem is that by this time, I think the root
partition, and thus the scrubber binary, have become inaccessible.
I seem to have two choices - either make the firmware accessible on an initrd or unload the module in a shutdown script
before the scrubber binary becomes inaccessible. The latter of these is the workaround I have implemented whilst the
problem I reported has been under investigation. For simplicity, I think I'll promote my workaround to being the
permanent solution.
So, apologies (and thanks) to everyone whose time I have taken up with this non-bug.
Chris
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> That page also explains what to do if mails like this annoy you.
>>
>> #regzbot ignore-activity
>>
>