Re: [PATCH v1] thermal: gov_step_wise: Go straight to instance->lower when mitigation is over

From: Rafael J. Wysocki
Date: Tue Jun 25 2024 - 08:33:20 EST


On Tue, Jun 25, 2024 at 9:34 AM Johan Hovold <johan@xxxxxxxxxx> wrote:
>
> On Sat, Jun 22, 2024 at 02:26:33PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> >
> > Commit b6846826982b ("thermal: gov_step_wise: Restore passive polling
> > management") attempted to fix a Step-Wise thermal governor issue
> > introduced by commit 042a3d80f118 ("thermal: core: Move passive polling
> > management to the core"), which caused the governor to leave cooling
> > devices in high states, by partially revering that commit.
>
> typo: reverting

Thanks!

> > However, this turns out to be insufficient on some systems due to
> > interactions between the governor code restored by commit b6846826982b
> > and the passive polling management in the thermal core.
>
> Care to elaborate on what went wrong here? In my test of the previous
> fix I saw the frequency ramping up in steps as expected when the
> temperature dropped. Under what circumstances would that fail to happen?

System suspend-resume would interfere with that as it would call
thermal_zone_device_init().

> > For this reason, revert commit b6846826982b and make the governor set
> > the target cooling device state to the "lower" one as soon as the zone
> > temperature falls below the threshold of the trip point corresponding
> > to the given thermal instance, which means that thermal mitigation is
> > not necessary any more.
> >
> > Before this change the "lower" cooling device state would be reached in
> > steps through the passive polling mechanism which was questionable for
> > three reasons: (1) cooling device were kept in high states when that was
> > not necessary (and it could adversely impact performance), (2) it only
> > worked for thermal zones with nonzero passive_delay_jiffies value, and
> > (3) passive polling belongs to the core and should not be hijacked by
> > governors for their internal purposes.
>
> I've tested this patch on the Lenovo ThinkPad X13s, where I could
> reproduce the rc1 regression, and things works as intended with the
> fix applied to rc5:
>
> Tested-by: Johan Hovold <johan+linaro@xxxxxxxxxx>
>
> The CPU frequency still oscillates heavily but now with a more
> sawtoothed curve.
>
> Not sure if it helps with performance, though, as running the CPU at
> full speed as soon as we drop below the threshold (with hysteresis)
> also means that we get back to running at the lowest frequency even
> faster.

True, but assuming that the reason for the mitigation is still there.
If it's actually gone, it's better to stop cooling as soon as it can
be done.

Thanks!