Re: [PATCH v3] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid

From: Daniel Lezcano
Date: Tue Jul 16 2024 - 08:10:35 EST


On 16/07/2024 13:36, Rafael J. Wysocki wrote:
On Tue, Jul 16, 2024 at 1:15 PM Stefan Lippers-Hollmann <s.l-h@xxxxxx> wrote:

Hi

On 2024-07-16, Stefan Lippers-Hollmann wrote:
On 2024-07-16, Rafael J. Wysocki wrote:
On Tue, Jul 16, 2024 at 1:48 AM Stefan Lippers-Hollmann <s.l-h@xxxxxx> wrote:
On 2024-07-15, Rafael J. Wysocki wrote:
On Mon, Jul 15, 2024 at 2:54 PM Stefan Lippers-Hollmann <s.l-h@xxxxxx> wrote:
On 2024-07-15, Rafael J. Wysocki wrote:
On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
<daniel.lezcano@xxxxxxxxxx> wrote:
On 15/07/2024 06:45, Eric Biggers wrote:
On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
[...]
Silencing the warnings is already a big improvement - and that patch
works to this extent for me with an ax200, thanks.

So attached is a patch that should avoid enabling the thermal zone
when it is not ready for use in the first place, so it should address
both the message and the useless polling.

I would appreciate giving it a go (please note that it hasn't received
much testing so far, though).

Sadly this patch doesn't seem to help:

This is likely because it is missing checks for firmware image type.
I've added them to the attached new version. Please try it.

I've also added two pr_info() messages to get a better idea of what's
going on, so please grep dmesg for "Thermal zone not ready" and
"Enabling thermal zone".

This is the output with the patch applied:

The ax200 wlan interface is currently not up/ configured (system
using its wired ethernet cards instead), the thermal_zone1 stops
if I manually enable the interface (ip link set dev wlp4s0 up)
after booting up:

This explains it, thanks!

The enabling of the thermal zone in iwl_mvm_load_ucode_wait_alive() is
premature or it should get disabled in the other two places that clear
the IWL_MVM_STATUS_FIRMWARE_RUNNING bit.

I'm not sure why the thermal zone depends on whether or not this bit
is set, though. Is it really a good idea to return errors from it if
the interface is not up?

$ dmesg | grep -i -e iwlwifi -e thermal
[ 0.080899] CPU0: Thermal monitoring enabled (TM1)
[ 0.113768] thermal_sys: Registered thermal governor 'fair_share'
[ 0.113770] thermal_sys: Registered thermal governor 'bang_bang'
[ 0.113771] thermal_sys: Registered thermal governor 'step_wise'
[ 0.113772] thermal_sys: Registered thermal governor 'user_space'
[ 0.113773] thermal_sys: Registered thermal governor 'power_allocator'
[ 3.759673] iwlwifi 0000:04:00.0: enabling device (0140 -> 0142)
[ 3.764918] iwlwifi 0000:04:00.0: Detected crf-id 0x3617, cnv-id 0x100530 wfpm id 0x80000000
[ 3.764974] iwlwifi 0000:04:00.0: PCI dev 2723/0084, rev=0x340, rfid=0x10a100
[ 3.769432] iwlwifi 0000:04:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
[ 3.873466] iwlwifi 0000:04:00.0: loaded firmware version 77.a20fb07d.0 cc-a0-77.ucode op_mode iwlmvm
[ 3.907122] iwlwifi 0000:04:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340
[ 3.907886] iwl_mvm_thermal_zone_register: Thermal zone not ready
[ 4.032380] iwlwifi 0000:04:00.0: Detected RF HR B3, rfid=0x10a100
[ 4.032392] thermal thermal_zone1: Enabling thermal zone
[ 4.098308] iwlwifi 0000:04:00.0: base HW address: 94:e6:f7:XX:XX:XX
[ 4.112535] thermal thermal_zone1: failed to read out thermal zone (-61)
[ 4.128306] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0
[ 4.369396] thermal thermal_zone1: failed to read out thermal zone (-61)
[ 4.625385] thermal thermal_zone1: failed to read out thermal zone (-61)
[ 4.881416] thermal thermal_zone1: failed to read out thermal zone (-61)
[ 5.137377] thermal thermal_zone1: failed to read out thermal zone (-61)
[ 5.394377] thermal thermal_zone1: failed to read out thermal zone (-61)
[ 5.649412] thermal thermal_zone1: failed to read out thermal zone (-61)
[ 5.905379] thermal thermal_zone1: failed to read out thermal zone (-61)
[ 6.161380] thermal thermal_zone1: failed to read out thermal zone (-61)
[ 6.418381] thermal thermal_zone1: failed to read out thermal zone (-61)
[ 6.673381] thermal thermal_zone1: failed to read out thermal zone (-61)
[ 6.929377] thermal thermal_zone1: failed to read out thermal zone (-61)
[...]
[ 21.009413] thermal thermal_zone1: failed to read out thermal zone (-61)
[ 21.265496] thermal thermal_zone1: failed to read out thermal zone (-61)
[ 21.521462] thermal thermal_zone1: failed to read out thermal zone (-61)
[ 21.777481] thermal thermal_zone1: failed to read out thermal zone (-61)
[ 22.033468] thermal thermal_zone1: failed to read out thermal zone (-61)
[ 22.213120] thermal thermal_zone1: Enabling thermal zone
[ 22.283954] iwlwifi 0000:04:00.0: Registered PHC clock: iwlwifi-PTP, with index: 0

Thanks for this data point!

AFAICS the thermal zone in iwlwifi is always enabled, but only valid
if the interface is up. It looks to me like the thermal core needs a
special "don't poll me" error code to be returned in such cases.

From my POV, it is not up to the thermal core to adapt to the driver.

Usually network devices have ops when they are transitioning to up or down, would it make sense to move enable / disable the thermal zone in these ops ?




--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog