Re: [PATCH v3] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
From: Stefan Lippers-Hollmann
Date: Mon Jul 15 2024 - 05:07:56 EST
Hi
On 2024-07-14, Eric Biggers wrote:
> On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> >
> > Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> > if zone temperature is invalid") caused __thermal_zone_device_update()
> > to return early if the current thermal zone temperature was invalid.
> >
> > This was done to avoid running handle_thermal_trip() and governor
> > callbacks in that case which led to confusion. However, it went too
> > far because monitor_thermal_zone() still needs to be called even when
> > the zone temperature is invalid to ensure that it will be updated
> > eventually in case thermal polling is enabled and the driver has no
> > other means to notify the core of zone temperature changes (for example,
> > it does not register an interrupt handler or ACPI notifier).
> >
> > Also if the .set_trips() zone callback is expected to set up monitoring
> > interrupts for a thermal zone, it needs to be provided with valid
> > boundaries and that can only be done if the zone temperature is known.
> >
> > Accordingly, to ensure that __thermal_zone_device_update() will
> > run again after a failing zone temperature check, make it call
> > monitor_thermal_zone() regardless of whether or not the zone
> > temperature is valid and make the latter schedule a thermal zone
> > temperature update if the zone temperature is invalid even if
> > polling is not enabled for the thermal zone (however, if this
> > continues to fail, give up after some time).
> >
> > Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
> > Reported-by: Daniel Lezcano <daniel.lezcano@xxxxxxxxxx>
> > Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@xxxxxxxxxx
> > Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@xxxxxxxxxxxxx
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
>
> On v6.10 I'm seeing the following messages spammed to the kernel log endlessly,
> and reverting this commit fixes it.
>
> [ 156.410567] thermal thermal_zone0: failed to read out thermal zone (-61)
[...]
> [ 158.458697] thermal thermal_zone0: failed to read out thermal zone (-61)
>
> /sys/class/thermal/thermal_zone0/type contains "iwlwifi_1".
I am observing the same issue on v6.10 with an Intel ax200 WLAN
card in a kaby-lake/ i5-7400 system and a Fujitsu D3400-B22
mainboard and the 'newest' BIOS (V5.0.0.12 R1.29.0) as well:
$ dmesg | grep -i -e iwlwifi -e thermal_zone2
[ 3.692433] iwlwifi 0000:04:00.0: enabling device (0140 -> 0142)
[ 3.698547] iwlwifi 0000:04:00.0: Detected crf-id 0x3617, cnv-id 0x100530 wfpm id 0x80000000
[ 3.698556] iwlwifi 0000:04:00.0: PCI dev 2723/0084, rev=0x340, rfid=0x10a100
[ 3.703292] iwlwifi 0000:04:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
[ 3.797296] iwlwifi 0000:04:00.0: loaded firmware version 77.a20fb07d.0 cc-a0-77.ucode op_mode iwlmvm
[ 4.090341] iwlwifi 0000:04:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340
[ 4.090524] thermal thermal_zone2: failed to read out thermal zone (-61)
[ 4.218496] iwlwifi 0000:04:00.0: Detected RF HR B3, rfid=0x10a100
[ 4.285399] iwlwifi 0000:04:00.0: base HW address: 94:e6:f7:XX:XX:XX
[ 4.341754] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0
[ 4.345445] thermal thermal_zone2: failed to read out thermal zone (-61)
[ 4.601400] thermal thermal_zone2: failed to read out thermal zone (-61)
[ 4.857372] thermal thermal_zone2: failed to read out thermal zone (-61)
[ 5.114387] thermal thermal_zone2: failed to read out thermal zone (-61)
[...]
[ 143.643801] thermal thermal_zone2: failed to read out thermal zone (-61)
[ 143.899818] thermal thermal_zone2: failed to read out thermal zone (-61)
[ 144.155813] thermal thermal_zone2: failed to read out thermal zone (-61)
[ 144.411815] thermal thermal_zone2: failed to read out thermal zone (-61)
[ 144.667828] thermal thermal_zone2: failed to read out thermal zone (-61)
[ 144.923801] thermal thermal_zone2: failed to read out thermal zone (-61)
[ 145.179822] thermal thermal_zone2: failed to read out thermal zone (-61)
[...]
$ cat /sys/class/thermal/thermal_zone2/type
iwlwifi_1
38cba05a86d157685d930a4400022eb4 /lib/firmware/iwlwifi-cc-a0-77.ucode
ce9c6e3bda22003f9a9b97cbca94b8215911b7a146c0f4f017963dbb1a233351 /lib/firmware/iwlwifi-cc-a0-77.ucode
git bisect led me to this commit as part of kernel v6.10:
$ LANG= git bisect log
git bisect start
# Status: warte auf guten und schlechten Commit
# bad: [0c3836482481200ead7b416ca80c68a29cfdaabd] Linux 6.10
git bisect bad 0c3836482481200ead7b416ca80c68a29cfdaabd
# Status: warte auf gute(n) Commit(s), schlechter Commit bekannt
# good: [a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6] Linux 6.9
git bisect good a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6
# good: [33e02dc69afbd8f1b85a51d74d72f139ba4ca623] Merge tag 'sound-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect good 33e02dc69afbd8f1b85a51d74d72f139ba4ca623
# good: [29c73fc794c83505066ee6db893b2a83ac5fac63] Merge tag 'perf-tools-for-v6.10-1-2024-05-21' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
git bisect good 29c73fc794c83505066ee6db893b2a83ac5fac63
# good: [e159d63e6940a2a16bb73616d8c528e93b84a6bb] Merge tag 'kvm-riscv-fixes-6.10-2' of https://github.com/kvm-riscv/linux into HEAD
git bisect good e159d63e6940a2a16bb73616d8c528e93b84a6bb
# good: [d1505b5cd0426bbddbbc99f10e3ae0b52aaa1d1f] Merge tag 'powerpc-6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
git bisect good d1505b5cd0426bbddbbc99f10e3ae0b52aaa1d1f
# good: [4a0929b0062a6b04207a414be9be97eb22965bc1] Merge tag 'media/v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect good 4a0929b0062a6b04207a414be9be97eb22965bc1
# bad: [ef2b7eb55e10294f4f384f21506ef20a6184128c] Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
git bisect bad ef2b7eb55e10294f4f384f21506ef20a6184128c
# good: [968460731f95be9977bc59a513acbc5afc71117d] Merge tag 'gpio-fixes-for-v6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
git bisect good 968460731f95be9977bc59a513acbc5afc71117d
# good: [5a4bd506ddad75f1f2711cfbcf7551a5504e3f1e] Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
git bisect good 5a4bd506ddad75f1f2711cfbcf7551a5504e3f1e
# bad: [a19ea421490dcc45c9f78145bb2703ac5d373b28] Merge tag 'platform-drivers-x86-v6.10-6' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
git bisect bad a19ea421490dcc45c9f78145bb2703ac5d373b28
# good: [34afb82a3c67f869267a26f593b6f8fc6bf35905] Merge tag '6.10-rc6-smb3-server-fixes' of git://git.samba.org/ksmbd
git bisect good 34afb82a3c67f869267a26f593b6f8fc6bf35905
# bad: [d045c46c52740b0d5e92d376f0b7843b0c0d935a] Merge tag 'thermal-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect bad d045c46c52740b0d5e92d376f0b7843b0c0d935a
# bad: [94eacc1c583dd2ba51a2158fb13285f5dc42714b] thermal: core: Fix list sorting in __thermal_zone_device_update()
git bisect bad 94eacc1c583dd2ba51a2158fb13285f5dc42714b
# bad: [a8a261774466d8691e555ea674c193bb1b09edab] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
git bisect bad a8a261774466d8691e555ea674c193bb1b09edab
# good: [aaa18ff54b97706b84306b6613630262706b1f6b] thermal: gov_power_allocator: Return early in manage if trip_max is NULL
git bisect good aaa18ff54b97706b84306b6613630262706b1f6b
# first bad commit: [a8a261774466d8691e555ea674c193bb1b09edab] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
Reverting 202aa0d4bb532338cd27bcc64c60abc2987a2be7 on top of v6.10 avoids
the issue for me.
$ lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers [8086:591f] (rev 05)
00:01.0 PCI bridge [0604]: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 05)
00:02.0 VGA compatible controller [0300]: Intel Corporation HD Graphics 630 [8086:5912] (rev 04)
00:14.0 USB controller [0c03]: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller [8086:a12f] (rev 31)
00:14.2 Signal processing controller [1180]: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem [8086:a131] (rev 31)
00:16.0 Communication controller [0780]: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 [8086:a13a] (rev 31)
00:17.0 SATA controller [0106]: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] [8086:a102] (rev 31)
00:1c.0 PCI bridge [0604]: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #5 [8086:a114] (rev f1)
00:1c.6 PCI bridge [0604]: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #7 [8086:a116] (rev f1)
00:1c.7 PCI bridge [0604]: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #8 [8086:a117] (rev f1)
00:1f.0 ISA bridge [0601]: Intel Corporation H110 Chipset LPC/eSPI Controller [8086:a143] (rev 31)
00:1f.2 Memory controller [0580]: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller [8086:a121] (rev 31)
00:1f.3 Audio device [0403]: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller [8086:a170] (rev 31)
00:1f.4 SMBus [0c05]: Intel Corporation 100 Series/C230 Series Chipset Family SMBus [8086:a123] (rev 31)
01:00.0 Non-Volatile memory controller [0108]: SK hynix BC901 NVMe Solid State Drive (DRAM-less) [1c5c:1d59] (rev 03)
02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller [10ec:8125] (rev 05)
03:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 0c)
04:00.0 Network controller [0280]: Intel Corporation Wi-Fi 6 AX200 [8086:2723] (rev 1a)
04:00.0 Network controller: Intel Corporation Wi-Fi 6 AX200 (rev 1a)
Subsystem: Intel Corporation Wi-Fi 6 AX200NGW
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 19
IOMMU group: 12
Region 0: Memory at efb00000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [c8] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [40] Express (v2) Endpoint, IntMsgNum 0
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W TEE-IO-
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr+ NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L1, Exit Latency L1 <8us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x1
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range B, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 16ms to 55ms, TimeoutDis-
AtomicOpsCtl: ReqEn-
IDOReq- IDOCompl- LTR+ EmergencyPowerReductionReq-
10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
LnkCap2: Supported Link Speeds: 2.5-5GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [80] MSI-X: Enable+ Count=16 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF-
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [14c v1] Latency Tolerance Reporting
Max snoop latency: 3145728ns
Max no snoop latency: 3145728ns
Capabilities: [154 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=30us PortTPowerOnTime=18us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=0ns
L1SubCtl2: T_PwrOn=44us
Kernel driver in use: iwlwifi
Kernel modules: iwlwifi
Regards
Stefan Lippers-Hollmann