Re: [BUG v6.3-rc4+] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_sysfs.c:879 cooling_device_stats_setup+0xac/0xc0

From: Wysocki, Rafael J
Date: Thu Mar 30 2023 - 15:08:32 EST


On 3/30/2023 11:57 AM, Zhang, Rui wrote:
Hi, Linus,

On Wed, 2023-03-29 at 15:52 -0700, Linus Torvalds wrote:
On Wed, Mar 29, 2023 at 1:58 PM Steven Rostedt <rostedt@xxxxxxxxxxx>
wrote:
In preparation to adding my patch that checks for some kinds of
bugs in
trace events, I decided to run it on the Linus's latest branch, to
see if
there's any other trace events that may cause issues. But instead I
hit
this unrelated bug. Looks to be triggering an added
lockdep_assert() on
boot up.
So I think that lockdep assert is likely bogus.

It is, sorry about that.

I have a fix queued up, https://patchwork.kernel.org/project/linux-pm/patch/2681615.mvXUDI8C0e@kreacher


It was added in commit 790930f44289 ("thermal: core: Introduce
thermal_cooling_device_update()") but the reason I say it's bogus is
that I don't think it has ever been tested:

static void cooling_device_stats_setup(struct
thermal_cooling_device *cdev)
{
lockdep_assert_held(&cdev->lock); <<<---- line 879
Yeah, so cooling_device_stats_setup() is called from two places:

- thermal_cooling_device_setup_sysfs()

- thermal_cooling_device_stats_reinit()

and that first place is when that cdev is created, before it's
registered anywhere. It's not locked in that case, and yes, the
lockdep_assert_held() will trigger.

As far as I can tell it will always trigger, and this
lockdep_assert()
has thus never been tested with lockdep enabled.

The "stats_reinit" case seems to also be called from only one place
(thermal_cooling_device_update()), and that path does indeed hold the
cdev->lock.

That lockdep could be made happy by having
thermal_cooling_device_setup_sysfs() create that device with the cdev
lock held. I guess that's easy enough, although somewhat annoyingly
there is no "mutex_init_locked()", you have to actually do
"mutex_init()" followed by a "mutex_lock()". And obviously unlock it
after doing the setup_sysfs().

But I question whether the lockdep test should be done at all. I find
it distasteful that it was added with absolutely zero testing.


I just realized why I cannot reproduce this problem on my testbox.

In order to test the original patch with ACPI passive cooling enabled,
I rebuild the kernel with customized DSDT.
This taints the kernel, and clears the debug_locks, thus I didn't get
any lockdep warnings...

So I was happy when I got a Tested-by from Rui on this and I didn't double check.  Turns out this was a mistake.

Cheers,

Rafael