Re: [PATCH v3] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid

From: Daniel Lezcano
Date: Mon Jul 15 2024 - 05:09:57 EST


On 15/07/2024 06:45, Eric Biggers wrote:
Hello,

On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>

Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
if zone temperature is invalid") caused __thermal_zone_device_update()
to return early if the current thermal zone temperature was invalid.

This was done to avoid running handle_thermal_trip() and governor
callbacks in that case which led to confusion. However, it went too
far because monitor_thermal_zone() still needs to be called even when
the zone temperature is invalid to ensure that it will be updated
eventually in case thermal polling is enabled and the driver has no
other means to notify the core of zone temperature changes (for example,
it does not register an interrupt handler or ACPI notifier).

Also if the .set_trips() zone callback is expected to set up monitoring
interrupts for a thermal zone, it needs to be provided with valid
boundaries and that can only be done if the zone temperature is known.

Accordingly, to ensure that __thermal_zone_device_update() will
run again after a failing zone temperature check, make it call
monitor_thermal_zone() regardless of whether or not the zone
temperature is valid and make the latter schedule a thermal zone
temperature update if the zone temperature is invalid even if
polling is not enabled for the thermal zone (however, if this
continues to fail, give up after some time).

Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
Reported-by: Daniel Lezcano <daniel.lezcano@xxxxxxxxxx>
Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@xxxxxxxxxx
Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@xxxxxxxxxxxxx
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>

On v6.10 I'm seeing the following messages spammed to the kernel log endlessly,
and reverting this commit fixes it.

[ 156.410567] thermal thermal_zone0: failed to read out thermal zone (-61)
[ 156.666583] thermal thermal_zone0: failed to read out thermal zone (-61)
[ 156.922598] thermal thermal_zone0: failed to read out thermal zone (-61)
[ 157.178613] thermal thermal_zone0: failed to read out thermal zone (-61)
[ 157.434636] thermal thermal_zone0: failed to read out thermal zone (-61)
[ 157.690774] thermal thermal_zone0: failed to read out thermal zone (-61)
[ 157.946659] thermal thermal_zone0: failed to read out thermal zone (-61)
[ 158.202717] thermal thermal_zone0: failed to read out thermal zone (-61)
[ 158.458697] thermal thermal_zone0: failed to read out thermal zone (-61)

/sys/class/thermal/thermal_zone0/type contains "iwlwifi_1".

Does the following change fixes the messages ?

diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/tt.c b/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
index 61a4638d1be2..b519db76d402 100644
--- a/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
+++ b/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
@@ -622,7 +622,7 @@ static int iwl_mvm_tzone_get_temp(struct thermal_zone_device *device,

if (!iwl_mvm_firmware_running(mvm) ||
mvm->fwrt.cur_fw_img != IWL_UCODE_REGULAR) {
- ret = -ENODATA;
+ ret = -EAGAIN;
goto out;
}


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog