Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

From: Chris Clayton
Date: Mon Feb 20 2023 - 05:51:59 EST




On 20/02/2023 05:35, Ben Skeggs wrote:
> On Sun, 19 Feb 2023 at 04:55, Chris Clayton <chris2553@xxxxxxxxxxxxxx> wrote:
>>
>>
>>
>> On 18/02/2023 15:19, Chris Clayton wrote:
>>>
>>>
>>> On 18/02/2023 12:25, Karol Herbst wrote:
>>>> On Sat, Feb 18, 2023 at 1:22 PM Chris Clayton <chris2553@xxxxxxxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 15/02/2023 11:09, Karol Herbst wrote:
>>>>>> On Wed, Feb 15, 2023 at 11:36 AM Linux regression tracking #update
>>>>>> (Thorsten Leemhuis) <regressions@xxxxxxxxxxxxx> wrote:
>>>>>>>
>>>>>>> On 13.02.23 10:14, Chris Clayton wrote:
>>>>>>>> On 13/02/2023 02:57, Dave Airlie wrote:
>>>>>>>>> On Sun, 12 Feb 2023 at 00:43, Chris Clayton <chris2553@xxxxxxxxxxxxxx> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>>>>>>>>> On 10.02.23 20:01, Karol Herbst wrote:
>>>>>>>>>>>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
>>>>>>>>>>>> Leemhuis) <regressions@xxxxxxxxxxxxx> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 08.02.23 09:48, Chris Clayton wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yeah, looks like it. That's unfortunate, but happens. But there is still
>>>>>>>>>>>>> time to fix it and there is one thing I wonder:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Did any of the nouveau developers look at the netconsole captures Chris
>>>>>>>>>>>>> posted more than a week ago to check if they somehow help to track down
>>>>>>>>>>>>> the root of this problem?
>>>>>>>>>>>>
>>>>>>>>>>>> I did now and I can't spot anything. I think at this point it would
>>>>>>>>>>>> make sense to dump the active tasks/threads via sqsrq keys to see if
>>>>>>>>>>>> any is in a weird state preventing the machine from shutting down.
>>>>>>>>>>>
>>>>>>>>>>> Many thx for looking into it!
>>>>>>>>>>
>>>>>>>>>> Yes, thanks Karol.
>>>>>>>>>>
>>>>>>>>>> Attached is the output from dmesg when this block of code:
>>>>>>>>>>
>>>>>>>>>> /bin/mount /dev/sda7 /mnt/sda7
>>>>>>>>>> /bin/mountpoint /proc || /bin/mount /proc
>>>>>>>>>> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
>>>>>>>>>> /bin/echo t > /proc/sysrq-trigger
>>>>>>>>>> /bin/sleep 1
>>>>>>>>>> /bin/sync
>>>>>>>>>> /bin/sleep 1
>>>>>>>>>> kill $(pidof dmesg)
>>>>>>>>>> /bin/umount /mnt/sda7
>>>>>>>>>>
>>>>>>>>>> is executed immediately before /sbin/reboot is called as the final step of rebooting my system.
>>>>>>>>>>
>>>>>>>>>> I hope this is what you were looking for, but if not, please let me know what you need
>>>>>>>>
>>>>>>>> Thanks Dave. [...]
>>>>>>> FWIW, in case anyone strands here in the archives: the msg was
>>>>>>> truncated. The full post can be found in a new thread:
>>>>>>>
>>>>>>> https://lore.kernel.org/lkml/e0b80506-b3cf-315b-4327-1b988d86031e@xxxxxxxxxxxxxx/
>>>>>>>
>>>>>>> Sadly it seems the info "With runpm=0, both reboot and poweroff work on
>>>>>>> my laptop." didn't bring us much further to a solution. :-/ I don't
>>>>>>> really like it, but for regression tracking I'm now putting this on the
>>>>>>> back-burner, as a fix is not in sight.
>>>>>>>
>>>>>>> #regzbot monitor:
>>>>>>> https://lore.kernel.org/lkml/e0b80506-b3cf-315b-4327-1b988d86031e@xxxxxxxxxxxxxx/
>>>>>>> #regzbot backburner: hard to debug and apparently rare
>>>>>>> #regzbot ignore-activity
>>>>>>>
>>>>>>
>>>>>> yeah.. this bug looks a little annoying. Sadly the only Turing based
>>>>>> laptop I got doesn't work on Nouveau because of firmware related
>>>>>> issues and we probably need to get updated ones from Nvidia here :(
>>>>>>
>>>>>> But it's a bit weird that the kernel doesn't shutdown, because I don't
>>>>>> see anything in the logs which would prevent that from happening.
>>>>>> Unless it's waiting on one of the tasks to complete, but none of them
>>>>>> looked in any way nouveau related.
>>>>>>
>>>>>> If somebody else has any fancy kernel debugging tips here to figure
>>>>>> out why it hangs, that would be very helpful...
>>>>>>
>>>>>
>>>>> I think I've figured this out. It's to do with how my system is configured. I do have an initrd, but the only thing on
>>>>> it is the cpu microcode which, it is recommended, should be loaded early. The absence of the NVidia firmare from an
>>>>> initrd doesn't matter because the drivers for the hardware that need to load firmware are all built as modules, So, by
>>>>> the time the devices are configured via udev, the root partition is mounted and the drivers can get at the firmware.
>>>>>
>>>>> I've found, by turning on nouveau debug and taking a video of the screen as the system shuts down, that nouveau seems to
>>>>> be trying to run the scrubber very very late in the shutdown process. The problem is that by this time, I think the root
>>>>> partition, and thus the scrubber binary, have become inaccessible.
>>>>>
>>>>> I seem to have two choices - either make the firmware accessible on an initrd or unload the module in a shutdown script
>>>>> before the scrubber binary becomes inaccessible. The latter of these is the workaround I have implemented whilst the
>>>>> problem I reported has been under investigation. For simplicity, I think I'll promote my workaround to being the
>>>>> permanent solution.
>>>>>
>>>>> So, apologies (and thanks) to everyone whose time I have taken up with this non-bug.
>>>>>
>>>>
>>>> Well.. nouveau shouldn't prevent the system from shutting down if the
>>>> firmware file isn't available. Or at least it should print a
>>>> warning/error. Mind messing with the code a little to see if skipping
>>>> it kind of works? I probably can also come up with a patch by next
>>>> week.
>>>>
>>> Well, I'd love to but a quick glance at the code caused me to bump into this obscenity:
>>>
>>> int
>>> gm200_flcn_reset_wait_mem_scrubbing(struct nvkm_falcon *falcon)
>>> {
>>> nvkm_falcon_mask(falcon, 0x040, 0x00000000, 0x00000000);
>>>
>>> if (nvkm_msec(falcon->owner->device, 10,
>>> if (!(nvkm_falcon_rd32(falcon, 0x10c) & 0x00000006))
>>> break;
>>> ) < 0)
>>> return -ETIMEDOUT;
>>>
>>> return 0;
>>> }
>>>
>>> nvkm_msec is #defined to nvkm_usec which in turn is #defined to nvkm_nsec where the loop that the break is related to
>>> appears
>>
>> I think someone who knows the code needs to look at this. What I can confirm is that after a freeze, I waited for 90
>> seconds for a timeout to occur, but it didn't.
> Hey,
>
> Are you able to try the attached patch for me please?
>
> Thanks,
> Ben.
>

Thanks Ben.

Yes, this patch fixes the lockup on reboot and poweroff that I've been seeing on my laptop. As you would expect,
offloaded rendering is still working and the discrete GPU is being powered on and off as required.

Thanks.

Reported-by: Chris Clayton <chris2553@xxxxxxxxxxxxxx>
Tested-by: Chris Clayton <chris2553@xxxxxxxxxxxxxx>

>>
>>
>> .> Chris
>>>>>
>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>>>> --
>>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>> That page also explains what to do if mails like this annoy you.
>>>>>>>
>>>>>>> #regzbot ignore-activity
>>>>>>>
>>>>>>
>>>>>
>>>>