Re: [PATCH v2] thermal: core: Address thermal zone removal races with resume

From: Mauricio Faria de Oliveira

Date: Fri Mar 27 2026 - 10:55:46 EST


On 2026-03-27 06:49, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
>
> Since thermal_zone_pm_complete() and thermal_zone_device_resume()
> re-initialize the poll_queue delayed work for the given thermal zone,
> the cancel_delayed_work_sync() in thermal_zone_device_unregister()
> may miss some already running work items and the thermal zone may
> be freed prematurely [1].
>
> There are two failing scenarios that both start with
> running thermal_pm_notify_complete() right before invoking
> thermal_zone_device_unregister() for one of the thermal zones.
>
> In the first scenario, there is a work item already running for
> the given thermal zone when thermal_pm_notify_complete() calls
> thermal_zone_pm_complete() for that thermal zone and it continues to
> run when thermal_zone_device_unregister() starts. Since the poll_queue
> delayed work has been re-initialized by thermal_pm_notify_complete(), the
> running work item will be missed by the cancel_delayed_work_sync() in
> thermal_zone_device_unregister() and if it continues to run past the
> freeing of the thermal zone object, a use-after-free will occur.
>
> In the second scenario, thermal_zone_device_resume() queued up by
> thermal_pm_notify_complete() runs right after the thermal_zone_exit()
> called by thermal_zone_device_unregister() has returned. The poll_queue
> delayed work is re-initialized by it before cancel_delayed_work_sync() is
> called by thermal_zone_device_unregister(), so it may continue to run
> after the freeing of the thermal zone object, which also leads to a
> use-after-free.
>
> Address the first failing scenario by ensuring that no thermal work
> items will be running when thermal_pm_notify_complete() is called.
> For this purpose, first move the cancel_delayed_work() call from
> thermal_zone_pm_complete() to thermal_zone_pm_prepare() to prevent
> new work from entering the workqueue going forward. Next, switch
> over to using a dedicated workqueue for thermal events and update
> the code in thermal_pm_notify() to flush that workqueue after
> thermal_pm_notify_prepare() has returned which will take care of
> all leftover thermal work already on the workqueue (that leftover
> work would do nothing useful anyway because all of the thermal zones
> have been flagged as suspended).
>
> The second failing scenario is addressed by adding a tz->state check
> to thermal_zone_device_resume() to prevent it from re-initializing
> the poll_queue delayed work if the thermal zone is going away.
>
> Note that the above changes will also facilitate relocating the suspend
> and resume of thermal zones closer to the suspend and resume of devices,
> respectively.
>
> Fixes: 5a5efdaffda5 ("thermal: core: Resume thermal zones asynchronously")
> Reported-by: Mauricio Faria de Oliveira <mfo@xxxxxxxxxx>
> Closes: https://lore.kernel.org/linux-pm/20260324-thermal-core-uaf-init_delayed_work-v1-1-6611ae76a8a1@xxxxxxxxxx/ [1]
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> ---

I spent some hours reviewing and thinking of corner cases for this
patch,
and found the code to correctly handle all cases I thought of.

I also tested it with the synthetic reproducers for the 2 cases reported
in [1], and it prevented the errors.

Please feel free to add:

Reviewed-by: Mauricio Faria de Oliveira <mfo@xxxxxxxxxx>
Tested-by: Mauricio Faria de Oliveira <mfo@xxxxxxxxxx>

And please carry these from [1], if that's OK with you:

Reported-by: syzbot+3b3852c6031d0f30dfaf@xxxxxxxxxxxxxxxxxxxxxxxxx
Closes: https://syzbot.org/bug?extid=3b3852c6031d0f30dfaf

Thanks,
Mauricio

>
> v1 -> v2: Return -ENOMEM from thermal_init() when the workqueue allocation
> fails (Sashiko)
>
> Lukasz, Daniel, I'm quite confident about this patch, but I would appreciate
> your feedback.
>
> ---
> drivers/thermal/thermal_core.c | 31 ++++++++++++++++++++++++++-----
> 1 file changed, 26 insertions(+), 5 deletions(-)
>
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -42,6 +42,8 @@ static struct thermal_governor *def_gove
>
> static bool thermal_pm_suspended;
>
> +static struct workqueue_struct *thermal_wq __ro_after_init;
> +
> /*
> * Governor section: set of functions to handle thermal governors
> *
> @@ -314,7 +316,7 @@ static void thermal_zone_device_set_poll
> if (delay > HZ)
> delay = round_jiffies_relative(delay);
>
> - mod_delayed_work(system_freezable_power_efficient_wq, &tz->poll_queue, delay);
> + mod_delayed_work(thermal_wq, &tz->poll_queue, delay);
> }
>
> static void thermal_zone_recheck(struct thermal_zone_device *tz, int error)
> @@ -1795,6 +1797,10 @@ static void thermal_zone_device_resume(s
>
> guard(thermal_zone)(tz);
>
> + /* If the thermal zone is going away, there's nothing to do. */
> + if (tz->state & TZ_STATE_FLAG_EXIT)
> + return;
> +
> tz->state &= ~(TZ_STATE_FLAG_SUSPENDED | TZ_STATE_FLAG_RESUMING);
>
> thermal_debug_tz_resume(tz);
> @@ -1825,6 +1831,9 @@ static void thermal_zone_pm_prepare(stru
> }
>
> tz->state |= TZ_STATE_FLAG_SUSPENDED;
> +
> + /* Prevent new work from getting to the workqueue subsequently. */
> + cancel_delayed_work(&tz->poll_queue);
> }
>
> static void thermal_pm_notify_prepare(void)
> @@ -1843,8 +1852,6 @@ static void thermal_zone_pm_complete(str
> {
> guard(thermal_zone)(tz);
>
> - cancel_delayed_work(&tz->poll_queue);
> -
> reinit_completion(&tz->resume);
> tz->state |= TZ_STATE_FLAG_RESUMING;
>
> @@ -1854,7 +1861,7 @@ static void thermal_zone_pm_complete(str
> */
> INIT_DELAYED_WORK(&tz->poll_queue, thermal_zone_device_resume);
> /* Queue up the work without a delay. */
> - mod_delayed_work(system_freezable_power_efficient_wq, &tz->poll_queue, 0);
> + mod_delayed_work(thermal_wq, &tz->poll_queue, 0);
> }
>
> static void thermal_pm_notify_complete(void)
> @@ -1877,6 +1884,11 @@ static int thermal_pm_notify(struct noti
> case PM_RESTORE_PREPARE:
> case PM_SUSPEND_PREPARE:
> thermal_pm_notify_prepare();
> + /*
> + * Allow any leftover thermal work items already on the
> + * worqueue to complete so they don't get in the way later.
> + */
> + flush_workqueue(thermal_wq);
> break;
> case PM_POST_HIBERNATION:
> case PM_POST_RESTORE:
> @@ -1909,9 +1921,16 @@ static int __init thermal_init(void)
> if (result)
> goto error;
>
> + thermal_wq = alloc_workqueue("thermal_events",
> + WQ_FREEZABLE | WQ_POWER_EFFICIENT | WQ_PERCPU, 0);
> + if (!thermal_wq) {
> + result = -ENOMEM;
> + goto unregister_netlink;
> + }
> +
> result = thermal_register_governors();
> if (result)
> - goto unregister_netlink;
> + goto destroy_workqueue;
>
> thermal_class = kzalloc_obj(*thermal_class);
> if (!thermal_class) {
> @@ -1938,6 +1957,8 @@ static int __init thermal_init(void)
>
> unregister_governors:
> thermal_unregister_governors();
> +destroy_workqueue:
> + destroy_workqueue(thermal_wq);
> unregister_netlink:
> thermal_netlink_exit();
> error:

--
Mauricio