Re: [PATCH v3] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid

From: Rafael J. Wysocki
Date: Tue Jul 16 2024 - 08:18:55 EST


On Tue, Jul 16, 2024 at 2:10 PM Daniel Lezcano
<daniel.lezcano@xxxxxxxxxx> wrote:
>
> On 16/07/2024 13:36, Rafael J. Wysocki wrote:
> > On Tue, Jul 16, 2024 at 1:15 PM Stefan Lippers-Hollmann <s.l-h@xxxxxx> wrote:
> >>
> >> Hi
> >>
> >> On 2024-07-16, Stefan Lippers-Hollmann wrote:
> >>> On 2024-07-16, Rafael J. Wysocki wrote:
> >>>> On Tue, Jul 16, 2024 at 1:48 AM Stefan Lippers-Hollmann <s.l-h@xxxxxx> wrote:
> >>>>> On 2024-07-15, Rafael J. Wysocki wrote:
> >>>>>> On Mon, Jul 15, 2024 at 2:54 PM Stefan Lippers-Hollmann <s.l-h@xxxxxx> wrote:
> >>>>>>> On 2024-07-15, Rafael J. Wysocki wrote:
> >>>>>>>> On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
> >>>>>>>> <daniel.lezcano@xxxxxxxxxx> wrote:
> >>>>>>>>> On 15/07/2024 06:45, Eric Biggers wrote:
> >>>>>>>>>> On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> >>>>>>>>>>> From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> >>>>> [...]
> >>>>>>> Silencing the warnings is already a big improvement - and that patch
> >>>>>>> works to this extent for me with an ax200, thanks.
> >>>>>>
> >>>>>> So attached is a patch that should avoid enabling the thermal zone
> >>>>>> when it is not ready for use in the first place, so it should address
> >>>>>> both the message and the useless polling.
> >>>>>>
> >>>>>> I would appreciate giving it a go (please note that it hasn't received
> >>>>>> much testing so far, though).
> >>>>>
> >>>>> Sadly this patch doesn't seem to help:
> >>>>
> >>>> This is likely because it is missing checks for firmware image type.
> >>>> I've added them to the attached new version. Please try it.
> >>>>
> >>>> I've also added two pr_info() messages to get a better idea of what's
> >>>> going on, so please grep dmesg for "Thermal zone not ready" and
> >>>> "Enabling thermal zone".
> >>>
> >>> This is the output with the patch applied:
> >>
> >> The ax200 wlan interface is currently not up/ configured (system
> >> using its wired ethernet cards instead), the thermal_zone1 stops
> >> if I manually enable the interface (ip link set dev wlp4s0 up)
> >> after booting up:
> >
> > This explains it, thanks!
> >
> > The enabling of the thermal zone in iwl_mvm_load_ucode_wait_alive() is
> > premature or it should get disabled in the other two places that clear
> > the IWL_MVM_STATUS_FIRMWARE_RUNNING bit.
> >
> > I'm not sure why the thermal zone depends on whether or not this bit
> > is set, though. Is it really a good idea to return errors from it if
> > the interface is not up?
> >
> >> $ dmesg | grep -i -e iwlwifi -e thermal
> >> [ 0.080899] CPU0: Thermal monitoring enabled (TM1)
> >> [ 0.113768] thermal_sys: Registered thermal governor 'fair_share'
> >> [ 0.113770] thermal_sys: Registered thermal governor 'bang_bang'
> >> [ 0.113771] thermal_sys: Registered thermal governor 'step_wise'
> >> [ 0.113772] thermal_sys: Registered thermal governor 'user_space'
> >> [ 0.113773] thermal_sys: Registered thermal governor 'power_allocator'
> >> [ 3.759673] iwlwifi 0000:04:00.0: enabling device (0140 -> 0142)
> >> [ 3.764918] iwlwifi 0000:04:00.0: Detected crf-id 0x3617, cnv-id 0x100530 wfpm id 0x80000000
> >> [ 3.764974] iwlwifi 0000:04:00.0: PCI dev 2723/0084, rev=0x340, rfid=0x10a100
> >> [ 3.769432] iwlwifi 0000:04:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
> >> [ 3.873466] iwlwifi 0000:04:00.0: loaded firmware version 77.a20fb07d.0 cc-a0-77.ucode op_mode iwlmvm
> >> [ 3.907122] iwlwifi 0000:04:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340
> >> [ 3.907886] iwl_mvm_thermal_zone_register: Thermal zone not ready
> >> [ 4.032380] iwlwifi 0000:04:00.0: Detected RF HR B3, rfid=0x10a100
> >> [ 4.032392] thermal thermal_zone1: Enabling thermal zone
> >> [ 4.098308] iwlwifi 0000:04:00.0: base HW address: 94:e6:f7:XX:XX:XX
> >> [ 4.112535] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [ 4.128306] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0
> >> [ 4.369396] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [ 4.625385] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [ 4.881416] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [ 5.137377] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [ 5.394377] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [ 5.649412] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [ 5.905379] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [ 6.161380] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [ 6.418381] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [ 6.673381] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [ 6.929377] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [...]
> >> [ 21.009413] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [ 21.265496] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [ 21.521462] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [ 21.777481] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [ 22.033468] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [ 22.213120] thermal thermal_zone1: Enabling thermal zone
> >> [ 22.283954] iwlwifi 0000:04:00.0: Registered PHC clock: iwlwifi-PTP, with index: 0
> >
> > Thanks for this data point!
> >
> > AFAICS the thermal zone in iwlwifi is always enabled, but only valid
> > if the interface is up. It looks to me like the thermal core needs a
> > special "don't poll me" error code to be returned in such cases.
>
> From my POV, it is not up to the thermal core to adapt to the driver.

The core provides a service to its users, not the other way around,
and this is a valid use case.

The owner of the thermal zone knows that it is only useful when the
interface is up and so it should be possible for them to indicate to
the core that, for the time being, nothing needs to be done.

> Usually network devices have ops when they are transitioning to up or
> down, would it make sense to move enable / disable the thermal zone in
> these ops ?

Not really, because it can be enabled and disabled via sysfs in the meantime.