Re: [PATCH] thermal: core: fix use-after-free due to init/cancel delayed_work race

From: Rafael J. Wysocki

Date: Wed Mar 25 2026 - 12:34:55 EST


On Wed, Mar 25, 2026 at 4:13 PM Mauricio Faria de Oliveira
<mfo@xxxxxxxxxx> wrote:
>
> On 2026-03-25 11:28, Mauricio Faria de Oliveira wrote:
> > On 2026-03-25 11:17, Mauricio Faria de Oliveira wrote:
> >> Thanks for looking into this.
> >>
> >> On 2026-03-25 09:47, Rafael J. Wysocki wrote:
> >>> I can see the one between thermal_zone_device_unregister() and
> >>> thermal_zone_device_resume(), but that can be addressed by adding a
> >>> TZ_STATE_FLAG_EXIT check to the latter AFAICS.
> >>
> >
> > Please disregard this paragraph; I incorrectly read/wrote _resume()
> > as thermal_zone_pm_complete() discussed above. The rest should be
> > right. I'll review this and get back shortly.
> >
> >> In the example describe above and detailed below, apparently that
> >> is not sufficient, if I'm not missing anything. See, if _resume()
> >> is reached with thermal_list_lock held, thermal_zone_device_exit()
> >> is waiting for thermal_list_lock before setting TZ_STATE_FLAG_EXIT,
> >> thus a check for it in _resume() would find it clear yet.
>
> Ok, similarly:
>
> Say, thermal_pm_notify() -> thermal_pm_notify_complete() ->
> thermal_zone_pm_complete()
> run before thermal_zone_device_unregister() is called;
> thermal_zone_device_resume()
> starts, and by now thermal_zone_device_unregister() is called.
>
> If thermal_zone_device_resume() wins the race over thermal_zone_exit()
> for guard(thermal_zone(tz) (tz->lock), it sees TZ_STATE_FLAG_EXIT clear;
> note its callees (eg, thermal_zone_device_init()) run with tz->lock
> held,
> so they see it clear as well.
>
> So, thermal_zone_device_init() calls INIT_DELAYED_WORK(), everything
> returns, tz->lock is released and the thermal_zone_device_unregister()
> -> thermal_zone_exit() path can continue to run.
>
> Only now thermal_zone_exit() sets TZ_STATE_FLAG_EXIT (too late),
> returns.
> cancel_delayed_work_sync() does not wait for
> thermal_zone_device_resume()
> due to INIT_DELAYED_WORK() in thermal_zone_device_init(); and kfree(tz).
>
> Then, thermal_zone_device_resume() accesses tz and hits use-after-free.
>
> Hope this clarifies. Please let me know your thoughts. Thanks!

Thanks for the analysis, it sounds accurate.

I'd say that thermal_zone_device_unregister() needs to flush the
workqueue before calling cancel_delayed_work_sync() to get rid of the
stuff that may be running out of it that hasn't seen the changes made
by thermal_zone_exit().

This should take care of all of the existing races because if anything
is running out of the workqueue when thermal_zone_device_unregister()
runs, it will be waited for after calling thermal_zone_exit() and any
leftover stuff will be caught by cancel_delayed_work_sync().

Of course, it's better to switch over to using a dedicated workqueue
in the thermal core for that.