Re: power-off delay/hang due to commit 6d25be57 (mainline)

From: Sebastian Andrzej Siewior
Date: Tue Aug 11 2020 - 11:25:56 EST


On 2020-08-11 16:34:09 [+0200], Rafael J. Wysocki wrote:
> On Tue, Aug 11, 2020 at 3:29 PM Sebastian Andrzej Siewior
> <bigeasy@xxxxxxxxxxxxx> wrote:
> >
> > On 2020-08-11 13:58:39 [+0200], Stephen Berman wrote:
> > > him about your workaround of adding 'thermal.tzp=300' to the kernel
> > > commandline, and he replied that this works for him too. And it turns
> > > out we have similar motherboards: I have a Gigabyte Z390 M Gaming
> > > Rev. 1001 board and he has Gigabyte Z390 Designare rev 1.0.
> >
> > Yes. Based on latest dmesg, the ACPI tables contain code which schedules
> > the worker and takes so long. It is possible / likely that his board
> > contains the same tables which leads to the same effect. After all those
> > two boards are very similar from the naming part :)
> > Would you mind to dump the ACPI tables and send them? There might be
> > some hints.
>
> Do we have a BZ for this? It would be useful to open one if not.

no, it came via lkml and I looked at it since it was bisected to a
workqueue commit with my signoff…
Stephen, can you open a bug on https://bugzilla.kernel.org/?

> > It might be possible that a BIOS update fixes the problem but I would
> > prefer very much to fix this in kernel to ensure that such a BIOS does
> > not lead to this problem again.
>
> I agree.
>
> It looks like one way to address this issue might be to add a rate
> limit for thermal notifications on a given zone.

So one thing is that ACPI says to poll every second and driver is doing
it. This could be increased to something like 15 or 30 seconds as lower
sane level. I don't think there is much value in polling this sensor
every second. As workaound, Stephen is using `thermal.tzp=300' now.

Would it make sense to flush the workqueue before checking the
temperature? I have no idea what the ACPI is doing there but there is no
upper limit on time how long in may take, right? Doing this inline (and
avoiding the worker) is probably causing other trouble, right?

Sebastian