RE: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()
From: Luck, Tony
Date: Wed May 06 2026 - 19:15:15 EST
> >> Unrelated to this question but may be worth a mention in the fix is that this work focuses
> >> and fixes resctrl to not access freed memory from the worker self. To complement this it may
> >> be worthwhile to highlight that it is safe for the work_struct self to be deleted while the
> >> work is running (but blocked on cpus_read_lock()) based on the following comment from
> >> kernel/workqueue.c:process_one_work():
> >> "It is permissible to free the struct work_struct from inside the function that is called
> >> from it ..."
> >
> > Scope increased from just the use-after-free when the domain was deleted. The case
> > for taking the current worker CPU offline doesn't involve a use-after-free. It just results
> > in running the workier on the wrong CPU for one iteration.
> >
> > Deleting the work_struct inside the called function is different from some agent deleting
> > the work_struct while the worker is running.
>
> Right. I interpret this to mean that judging the safety of work_struct removal should consider not
> only the workqueue API itself but also external agents that may access the work_struct after its
> removal. The current fix addresses access to removed work_struct from within worker itself while I
> interpret the workqueue API to guarantee that there will be no access to work_struct during or
> after worker execution. The fix under development thus makes it possible to safely remove the
> domain even if a worker belonging to it is executing and blocked on cpus_read_lock(). Do you
> see any remaining issues here?
OK. I'll add something to the commit message.
I asked my original AI about this fix. It claimed to find problems relating to kernel using the work_struct
after return from the function. Pasting in that comment you gave me from process_one_work() about
it being OK to free the work_struct made it reconsider and retract.
Another AI (using a copy of the sashiko rules) has found an issue with our reliance on is_percpu_thread()
The problem is the ordering of hotplug callbacks.
resctrl_arch_offline_cpu() runs early because it is in the CPUHP_AP_ONLINE_DYN class. AI claims
that cpus_write_lock() is released after running this, but before running workqueue_offline_cpu() in the
CPUHP_AP_WORKQUEUE_ONLINE class.
So our worker may obtain cpus_read_lock() and not yet lost its_percpu_thread() status.
-Tony