Re: [PATCH v1] thermal: core: Do not fail cdev registration because of invalid initial state

From: Rafael J. Wysocki
Date: Thu Jun 06 2024 - 10:39:57 EST


On Thu, Jun 6, 2024 at 3:42 PM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
>
> On Thu, Jun 6, 2024 at 3:07 PM Daniel Lezcano <daniel.lezcano@xxxxxxxxxx> wrote:
> >
> > On 05/06/2024 21:17, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> > >
> > > It is reported that commit 31a0fa0019b0 ("thermal/debugfs: Pass cooling
> > > device state to thermal_debug_cdev_add()") causes the ACPI fan driver
> > > to fail probing on some systems which turns out to be due to the _FST
> > > control method returning an invalid value until _FSL is first evaluated
> > > for the given fan. If this happens, the .get_cur_state() cooling device
> > > callback returns an error and __thermal_cooling_device_register() fails
> > > as uses that callback after commit 31a0fa0019b0.
> > >
> > > Arguably, _FST should not return an inavlid value even if it is
> > > evaluated before _FSL, so this may be regarded as a platform firmware
> > > issue, but at the same time it is not a good enough reason for failing
> > > the cooling device registration where the initial cooling device state
> > > is only needed to initialize a thermal debug facility.
> > >
> > > Accordingly, modify __thermal_cooling_device_register() to pass a
> > > negative state value to thermal_debug_cdev_add() instead of failing
> > > if the initial .get_cur_state() callback invocation fails and adjust
> > > the thermal debug code to ignore negative cooling device state values.
> > >
> > > Fixes: 31a0fa0019b0 ("thermal/debugfs: Pass cooling device state to thermal_debug_cdev_add()")
> > > Closes: https://lore.kernel.org/linux-acpi/20240530153727.843378-1-laura.nao@xxxxxxxxxxxxx
> > > Reported-by: Laura Nao <laura.nao@xxxxxxxxxxxxx>
> > > Tested-by: Laura Nao <laura.nao@xxxxxxxxxxxxx>
> > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> >
> > As it is a driver issue, it should be fixed in the driver, not in the
> > core code. The resulting code logic in the core is trying to deal with
> > bad driver behavior, it does not really seem appropriate.

Besides, I don't quite agree with dismissing it as a driver issue. If
a driver cannot determine the cooling device state, it should not be
required to make it up.

Because .get_cur_state() is specifically designed to be able to return
an error, the core should be prepared to deal with errors returned by
it and propagating the error is not always the best choice, like in
this particular case.

> > The core code has been clean up from the high friction it had with the
> > legacy ACPI code. It would be nice to continue it this direction.

This isn't really ACPI specific. Any driver can return an error from
.get_cur_state() if it has a good enough reason.

> Essentially, you are saying that .get_cur_state() should not return an
> error even if it gets an utterly invalid value from the platform
> firmware.
>
> What value should it return then?