Re: [bug/6.3-rc4/bisected] WARNING at cooling_device_stats_setup+0xac caused by commit 790930f44289c8209c57461b2db499fcc702e0b3

From: Rafael J. Wysocki
Date: Thu Mar 30 2023 - 06:07:29 EST


On Thu, Mar 30, 2023 at 9:52 AM Mikhail Gavrilov
<mikhail.v.gavrilov@xxxxxxxxx> wrote:
>
> Hi,
> The release 6.3-rc4 brings new warning messages to log:

Thanks for the report, please see this patch:

https://patchwork.kernel.org/project/linux-pm/patch/2681615.mvXUDI8C0e@kreacher/

> [ 4.590775] ------------[ cut here ]------------
> [ 4.590783] WARNING: CPU: 2 PID: 1 at
> drivers/thermal/thermal_sysfs.c:879
> cooling_device_stats_setup+0xac/0xc0
> [ 4.590799] Modules linked in:
> [ 4.590806] CPU: 2 PID: 1 Comm: swapper/0 Not tainted
> 6.3.0-rc3-08-790930f44289c8209c57461b2db499fcc702e0b3+ #87
> [ 4.590819] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
> G513QY_G513QY/G513QY, BIOS G513QY.320 09/07/2022
> [ 4.590832] RIP: 0010:cooling_device_stats_setup+0xac/0xc0
> [ 4.590841] Code: ff 48 89 1d 9e 27 9f 01 5b 5d 41 5c c3 cc cc cc
> cc 48 8d bf 60 05 00 00 be ff ff ff ff e8 5c 16 3b 00 85 c0 0f 85 72
> ff ff ff <0f> 0b e9 6b ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90
> 90 90
> [ 4.590863] RSP: 0018:ffffa5a080107c60 EFLAGS: 00010246
> [ 4.590871] RAX: 0000000000000000 RBX: ffff96fc51f6d800 RCX: 0000000000000001
> [ 4.590880] RDX: 0000000000000000 RSI: ffffffffb9a7f591 RDI: ffffffffb9b325ce
> [ 4.590889] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000001
> [ 4.590898] R10: 0000000000000001 R11: 0000000000000001 R12: ffff96fc51f6d800
> [ 4.590907] R13: ffff96fc51f6d818 R14: ffff96fc4b450000 R15: 0000000000000000
> [ 4.590916] FS: 0000000000000000(0000) GS:ffff970b16a00000(0000)
> knlGS:0000000000000000
> [ 4.590927] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4.590934] CR2: 0000000000000000 CR3: 000000034643c000 CR4: 0000000000750ee0
> [ 4.590944] PKRU: 55555554
> [ 4.590948] Call Trace:
> [ 4.590953] <TASK>
> [ 4.590958] thermal_cooling_device_setup_sysfs+0xe/0x20
> [ 4.590967] __thermal_cooling_device_register.part.0+0x13c/0x3d0
> [ 4.590977] acpi_processor_thermal_init+0x22/0x100
> [ 4.590987] __acpi_processor_start+0x7f/0xf0
> [ 4.590995] acpi_processor_start+0x2c/0x50
> [ 4.591002] really_probe+0x19e/0x3e0
> [ 4.591010] ? __pfx___driver_attach+0x10/0x10
> [ 4.591017] __driver_probe_device+0x78/0x160
> [ 4.591025] driver_probe_device+0x1f/0x90
> [ 4.591032] __driver_attach+0xd2/0x1c0
> [ 4.591039] bus_for_each_dev+0x8b/0xe0
> [ 4.591047] bus_add_driver+0x115/0x210
> [ 4.591055] driver_register+0x55/0x100
> [ 4.591062] ? __pfx_acpi_processor_driver_init+0x10/0x10
> [ 4.591072] acpi_processor_driver_init+0x3b/0xc0
> [ 4.591080] ? __pfx_acpi_processor_driver_init+0x10/0x10
> [ 4.591088] do_one_initcall+0x70/0x290
> [ 4.591101] kernel_init_freeable+0x3c5/0x580
> [ 4.591112] ? __pfx_kernel_init+0x10/0x10
> [ 4.591122] kernel_init+0x16/0x1c0
> [ 4.591128] ret_from_fork+0x2c/0x50
> [ 4.591139] </TASK>
>
> This message appears after each boot.
>
> Bisect blaming this commit:
>
> commit 790930f44289c8209c57461b2db499fcc702e0b3
> Author: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> Date: Fri Mar 17 18:01:26 2023 +0100
>
> thermal: core: Introduce thermal_cooling_device_update()
>
> Introduce a core thermal API function, thermal_cooling_device_update(),
> for updating the max_state value for a cooling device and rearranging
> its statistics in sysfs after a possible change of its ->get_max_state()
> callback return value.
>
> That callback is now invoked only once, during cooling device
> registration, to populate the max_state field in the cooling device
> object, so if its return value changes, it needs to be invoked again
> and the new return value needs to be stored as max_state. Moreover,
> the statistics presented in sysfs need to be rearranged in general,
> because there may not be enough room in them to store data for all
> of the possible states (in the case when max_state grows).
>
> The new function takes care of that (and some other minor things
> related to it), but some extra locking and lockdep annotations are
> added in several places too to protect against crashes in the cases
> when the statistics are not present or when a stale max_state value
> might be used by sysfs attributes.
>
> Note that the actual user of the new function will be added separately.
>
> Link: https://lore.kernel.org/linux-pm/53ec1f06f61c984100868926f282647e57ecfb2d.camel@xxxxxxxxx/
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> Tested-by: Zhang Rui <rui.zhang@xxxxxxxxx>
> Reviewed-by: Zhang Rui <rui.zhang@xxxxxxxxx>
>
> drivers/thermal/thermal_core.c | 83 ++++++++++++++++++++++++++++++++++++++++-
> drivers/thermal/thermal_core.h | 2 +
> drivers/thermal/thermal_sysfs.c | 74 +++++++++++++++++++++++++++++++-----
> include/linux/thermal.h | 1 +
> 4 files changed, 150 insertions(+), 10 deletions(-)
>
> All my PCs turned up affected by this issue:
> - CPU: Ryzen 3950X / MB: ROG Strix X570-I
> - CPU Ruzen 7950X / MB: MPG B650I EDGE WIFI
> - Laptop: ASUS ROG Strix G15 G513QY-HF001 (CPU: 5900HX)
>
> Unfortunately I couldn't check revert this commit, because after
> reverting the kernel does not build.
>
> drivers/acpi/processor_thermal.c: In function ‘acpi_thermal_cpufreq_init’:
> drivers/acpi/processor_thermal.c:149:17: error: implicit declaration
> of function ‘thermal_cooling_device_update’; did you mean
> ‘thermal_zone_device_update’? [-Werror=implicit-function-declaration]
> 149 | thermal_cooling_device_update(pr->cdev);
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> | thermal_zone_device_update
>
>
> Who wants to see the full kernel log could see an attached archive (for laptop).
>
> --
> Best Regards,
> Mike Gavrilov.