Re: [PATCH v1 07/17] thermal/hwmon: Use the thermal API instead tampering the internals

From: Guenter Roeck
Date: Mon Feb 20 2023 - 12:12:17 EST


On Mon, Feb 20, 2023 at 04:39:48PM +0100, Daniel Lezcano wrote:
> On 20/02/2023 15:11, Guenter Roeck wrote:
> > On Mon, Feb 20, 2023 at 02:34:08PM +0100, Daniel Lezcano wrote:
> > > Hi Guenter,
> > >
> > > my script should have Cc'ed you but it didn't, so just a heads up this patch
> > > ;)
> > >
> > > On 19/02/2023 15:36, Daniel Lezcano wrote:
> > > > In this function, there is a guarantee the thermal zone is registered.
> > > >
> > > > The sysfs hwmon unregistering will be blocked until we exit the
> > > > function. The thermal zone is unregistered after the sysfs hwmon is
> > > > unregistered.
> > > >
> > > > When we are in this function, the thermal zone is registered.
> > > >
> > > > We can call the thermal_zone_get_crit_temp() function safely and let
> > > > the function use the lock which is private the thermal core code.
> > > >
> >
> > Hmm, if you say so. That very same call used to cause a crash in
> > Chromebooks, which is why I had added the locking.
>
> Mmh, I see. I guess we can assume thermal_hwmon is part of the core code and
> remove this change.
>

Yes. Anyway, the sequence of events was roughly as follows.

- thermal zone is device is registered
- hwmon device is registered
- userspace is triggered and starts reading device attributes
- while userspace has a hwmon attribute open, thermal device is unregistered
- hwmon device is unregistered (sysfs attribute is still open)
- hwmon device attribute function is called
- Since thermal device ops have been released after the thermal device
was unregistered, trying to call an ops callback fails.

That doesn't normally happen, but the Intel wireless driver has the habit
of registering a thermal zone early in its probe function, only to unregister
it immediately afterwards if the probe function fails. If some userspace
activity is triggered by the hwmon device registration, the thermal and
hwmon device removal may be timed such that the hwmon devive is removed
while one (or more) of its attribute files are still open. Normally that
doesn't matter, but it is fatal here since the ops callbacks are not owned
by the hwmon device but by the thermal device.

Essentially every ops callback has this problem.
thermal_zone_get_temp() had it as well, also associated with
a hwmon sysfs attribute read operation. See commit 1c6b30060777
("thermal/core: Ensure that thermal device is registered in
thermal_zone_get_temp").

If you don't want non-thermal code to access ->ops directly, the thermal
code would have to provide protected accessor functions, similar to
thermal_zone_get_temp().

Thanks,
Guenter

>
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>