Re: [PATCH] thermal/core: Correctly free tz->tzp in thermal zone registration error path

From: Rafael J. Wysocki
Date: Tue Dec 19 2023 - 10:28:30 EST


On Tue, Dec 19, 2023 at 9:27 AM Chen-Yu Tsai <wenst@xxxxxxxxxxxx> wrote:
>
> After commit 3d439b1a2ad3 ("thermal/core: Alloc-copy-free the thermal
> zone parameters structure"), the core now copies the thermal zone
> parameters structure, and frees it if an error happens during thermal
> zone device registration, or upon unregistration of the device.
>
> In the error path, if device_register() was called, then `tz` disappears
> before kfree(tz->tzp) happens, causing a NULL pointer deference crash.
>
> In my case, the error path was entered from the sbs power supply driver,
> which through the power supply core registers a thermal zone *without
> trip points* for the battery temperature sensor. This combined with
> setting the default thermal governor to "power allocator", which
> *requires* trip_max, causes the thermal zone registration to error out.
>
> The error path should handle the two cases, one where device_register
> has not happened and the kobj hasn't been reference counted, and vice
> versa where it has. The original commit tried to cover the first case,
> but fails for the second. Fix this by adding kfree(tz->tzp) before
> put_device() to cover the second case, and check if `tz` is still valid
> before calling kfree(tz->tzp) to avoid crashing in the second case.
>
> Fixes: 3d439b1a2ad3 ("thermal/core: Alloc-copy-free the thermal zone parameters structure")
> Signed-off-by: Chen-Yu Tsai <wenst@xxxxxxxxxxxx>
> ---
> This includes the minimal changes to fix the crash. I suppose some other
> things in the thermal core could be reworked:
> - Don't use "power allocator" for thermal zones without trip points
> - Move some of the thermal zone cleanup code into the release function
>
> drivers/thermal/thermal_core.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 2415dc50c31d..e47826d82062 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -1392,12 +1392,16 @@ thermal_zone_device_register_with_trips(const char *type, struct thermal_trip *t
> unregister:
> device_del(&tz->device);
> release_device:
> + /* Free tz->tzp before tz goes away. */
> + kfree(tz->tzp);
> put_device(&tz->device);
> tz = NULL;
> remove_id:
> ida_free(&thermal_tz_ida, id);
> free_tzp:
> - kfree(tz->tzp);
> + /* If we arrived here before device_register() was called. */
> + if (tz)
> + kfree(tz->tzp);
> free_tz:
> kfree(tz);
> return ERR_PTR(result);
> --

Can you please test linux-next from today? The issue addressed by
your patch should be fixed there.