Re: [REGRESSION] probe with driver acpi-fan failed with error -22

From: Rafael J. Wysocki
Date: Mon Jun 03 2024 - 14:20:31 EST


Hi,

On Fri, May 31, 2024 at 1:35 PM Laura Nao <laura.nao@xxxxxxxxxxxxx> wrote:
>
> Hello,
>
> On 5/30/24 17:37, Laura Nao wrote:
> > Hello,
> >
> > We have identified a regression in the acpi-fan driver probe between
> > v6.9-rc7 and v6.10-rc1 on some Intel Chromebooks in the Collabora LAVA
> > lab.
> >
> > For the Acer Chromebook Spin 514 (CP514-2H), the following error is
> > reported in the logs:
> >
> > [ 0.651202] acpi-fan INTC1044:00: probe with driver acpi-fan failed with error -22
> >
> > Similar errors are reported on other devices with fans compatible with
> > the same driver.
> >
> > On Acer Chromebox CXI4, ASUS Chromebook Flip C436FA and
> > HP Chromebook x360 14 G1:
> >
> > [ 0.488001] acpi-fan INT3404:00: probe with driver acpi-fan failed with error -22
> >
> > On ASUS Chromebook Vero 514 CBV514-1H:
> >
> > [ 1.168905] acpi-fan INTC1048:00: probe with driver acpi-fan failed with error -22
> >
> > The issue is still present on next-20240529.
> >
> > I'm sending this report to track the regression while a fix is
> > identified. I'll investigate the issue/run a bisection and report back
> > with the results.
> >
> > This regression was discovered during some preliminary tests with the
> > ACPI probe kselftest [1] in KernelCI. The config used was the upstream
> > x86_64 defconfig with a fragment applied on top [2].
> >
> > Best,
> >
> > Laura
> >
> > [1] https://lore.kernel.org/all/20240308144933.337107-1-laura.nao@xxxxxxxxxxxxx/
> > [2] https://pastebin.com/raw/0tFM0Zyg
> >
> > #regzbot introduced: v6.9-rc7..v6.10-rc1
>
> The issue started happening after:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/thermal/thermal_core.c?h=v6.10-rc1&id=31a0fa0019b022024cc082ae292951a596b06f8c
>
> Before this commit, get_cur_state() was not called by
> __thermal_cooling_device_register, so the error was not triggered.
>
> After enabling debugging for the acpi-fan driver, I noticed these errors
> in the logs:
>
> [ 0.682224] acpi INTC1044:00: Invalid control value returned
> [ 0.682635] acpi INTC1044:00: Invalid control value returned
>
> The value stored in fst.control is 255, which is indeed not a valid
> value.
>
> I suspect this might be a firmware issue that is now manifesting due to
> the addition of the extra get_cur_state() call.
>
> I'll dig a bit more and report back.

It looks like _FST returns all ones if it is evaluated before _FSL on
the affected platforms.

It shouldn't do that, but then it is not particularly useful to fail
cdev registration for this reason.

The attached patch should work around this issue, please try it and report back.
---
drivers/thermal/thermal_core.c | 11 +++++++----
drivers/thermal/thermal_debugfs.c | 7 ++++++-
2 files changed, 13 insertions(+), 5 deletions(-)

Index: linux-pm/drivers/thermal/thermal_core.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.c
+++ linux-pm/drivers/thermal/thermal_core.c
@@ -964,7 +964,8 @@ __thermal_cooling_device_register(struct
{
struct thermal_cooling_device *cdev;
struct thermal_zone_device *pos = NULL;
- unsigned long current_state;
+ unsigned long val;
+ int current_state;
int id, ret;

if (!ops || !ops->get_max_state || !ops->get_cur_state ||
@@ -1002,9 +1003,11 @@ __thermal_cooling_device_register(struct
if (ret)
goto out_cdev_type;

- ret = cdev->ops->get_cur_state(cdev, &current_state);
- if (ret)
- goto out_cdev_type;
+ ret = cdev->ops->get_cur_state(cdev, &val);
+ if (val >= 0 && val <= INT_MAX)
+ current_state = val;
+ else
+ current_state = -1;

thermal_cooling_device_setup_sysfs(cdev);

Index: linux-pm/drivers/thermal/thermal_debugfs.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_debugfs.c
+++ linux-pm/drivers/thermal/thermal_debugfs.c
@@ -421,6 +421,8 @@ void thermal_debug_cdev_state_update(con
cdev_dbg = &thermal_dbg->cdev_dbg;

old_state = cdev_dbg->current_state;
+ if (old_state < 0)
+ goto unlock;

/*
* Get the old state information in the durations list. If
@@ -463,6 +465,7 @@ void thermal_debug_cdev_state_update(con

cdev_dbg->total++;

+unlock:
mutex_unlock(&thermal_dbg->lock);
}

@@ -499,7 +502,9 @@ void thermal_debug_cdev_add(struct therm
* duration will be printed by cdev_dt_seq_show() as expected if it
* runs before the first state transition.
*/
- thermal_debugfs_cdev_record_get(thermal_dbg, cdev_dbg->durations, state);
+ if (state >= 0)
+ thermal_debugfs_cdev_record_get(thermal_dbg, cdev_dbg->durations,
+ state);

debugfs_create_file("trans_table", 0400, thermal_dbg->d_top,
thermal_dbg, &tt_fops);