Re: power-off delay/hang due to commit 6d25be57 (mainline)

From: Stephen Berman
Date: Tue Jun 16 2020 - 16:29:04 EST


On Tue, 16 Jun 2020 17:55:01 +0200 Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> wrote:

> On 2020-06-16 10:13:27 [+0200], Stephen Berman wrote:
>> Yes, thanks, that did it. Trace attached.
>
> So TZ10 is a temperature sensor of some kind on your motherboard. In
> your v5.6 dmesg there is:
> | thermal LNXTHERM:00: registered as thermal_zone0
> | ACPI: Thermal Zone [TZ10] (17 C)
>
> So. In /sys/class/thermal/thermal_zone0/device/path you should also see
> TZ10. And /sys/class/thermal/thermal_zone0/temp should show the actual
> value.
> This comes from the "thermal" module.

Yes, TZ10 was in the thermal_zone0/device/path and the value in
thermal_zone0/temp was 16800.

> Looking at the trace, might query the temperature every second which
> somehow results in "Dispatching Notify on". I don't understand how it
> gets from reading of the temperature to the notify part, maybe it is
> part of the ACPIâ
>
> However. Could you please make sure that the thermal module is not
> loaded at system startup? Adding
> thermal.off=1
>
> to the kernel commandline should do the trick. And you should see
> thermal control disabled
>
> in dmesg.

Confirmed. And the value in thermal_zone0/temp was now 33000.

> That means your thermal_zone0 with TZ10 does not show up in
> /sys and nothing should schedule the work-items. This in turn should
> allow you to shutdown your system without the delay.

It did!

> If this works, could you please try to load the module with tzp=300?
> If you add this
> thermal.tzp=300
>
> to the kernel commandline then it should do the trick. You can verify it
> by
> cat /sys/module/thermal/parameters/tzp
>
> This should change the polling interval from what ACPI says to 30secs.
> This should ensure that you don't have so many worker waiting. So you
> should also be able to shutdown the system.

Your assessment and predictions are right on the mark!

I'm fine with the thermal.tzp=300 workaround, but it would be good to
find out why this problem started with commit 6d25be57, if my git
bisection was correct, or if it wasn't, then at least somewhere between
5.1.0 and 5.2.0. Or can you already deduce why? If not, I'd be more
than happy to continue applying any patches or trying any suggestions
you have, if you want to continue debugging this issue. In any case,
thanks for pursuing it to this point.

Steve Berman