Re: [PATCH v1] thermal: core: Do not fail cdev registration because of invalid initial state

From: Daniel Lezcano
Date: Thu Jun 06 2024 - 11:12:02 EST


On 06/06/2024 16:18, Rafael J. Wysocki wrote:
On Thu, Jun 6, 2024 at 3:42 PM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:

On Thu, Jun 6, 2024 at 3:07 PM Daniel Lezcano <daniel.lezcano@xxxxxxxxxx> wrote:

On 05/06/2024 21:17, Rafael J. Wysocki wrote:
From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>

It is reported that commit 31a0fa0019b0 ("thermal/debugfs: Pass cooling
device state to thermal_debug_cdev_add()") causes the ACPI fan driver
to fail probing on some systems which turns out to be due to the _FST
control method returning an invalid value until _FSL is first evaluated
for the given fan. If this happens, the .get_cur_state() cooling device
callback returns an error and __thermal_cooling_device_register() fails
as uses that callback after commit 31a0fa0019b0.

Arguably, _FST should not return an inavlid value even if it is
evaluated before _FSL, so this may be regarded as a platform firmware
issue, but at the same time it is not a good enough reason for failing
the cooling device registration where the initial cooling device state
is only needed to initialize a thermal debug facility.

Accordingly, modify __thermal_cooling_device_register() to pass a
negative state value to thermal_debug_cdev_add() instead of failing
if the initial .get_cur_state() callback invocation fails and adjust
the thermal debug code to ignore negative cooling device state values.

Fixes: 31a0fa0019b0 ("thermal/debugfs: Pass cooling device state to thermal_debug_cdev_add()")
Closes: https://lore.kernel.org/linux-acpi/20240530153727.843378-1-laura.nao@xxxxxxxxxxxxx
Reported-by: Laura Nao <laura.nao@xxxxxxxxxxxxx>
Tested-by: Laura Nao <laura.nao@xxxxxxxxxxxxx>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>

As it is a driver issue, it should be fixed in the driver, not in the
core code. The resulting code logic in the core is trying to deal with
bad driver behavior, it does not really seem appropriate.

Besides, I don't quite agree with dismissing it as a driver issue. If
a driver cannot determine the cooling device state, it should not be
required to make it up.

Because .get_cur_state() is specifically designed to be able to return
an error, the core should be prepared to deal with errors returned by
it and propagating the error is not always the best choice, like in
this particular case.

The core code has been clean up from the high friction it had with the
legacy ACPI code. It would be nice to continue it this direction.

This isn't really ACPI specific. Any driver can return an error from
.get_cur_state() if it has a good enough reason.

We are talking about registration time, right? If the driver is registering too soon, eg. the firmware is not ready, should it fix the moment it is registering the cooling device when it is sure the firmware completed its initialization ?


Essentially, you are saying that .get_cur_state() should not return an
error even if it gets an utterly invalid value from the platform
firmware.

What value should it return then?

--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog