Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()

From: Luck, Tony

Date: Mon May 04 2026 - 18:52:31 EST

On Mon, May 04, 2026 at 08:11:48AM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 5/1/26 2:36 PM, Tony Luck wrote:
> > Sashiko noticed[1] a user-after-free in the resctrl worker thread code.
> >
> > resctrl_offline_mon_domain() acquires rdtgroup_mutex and calls
> > cancel_delayed_work() (non-synchronous) on the per-domain mbm_over and
> > cqm_limbo delayed_work items, then calls domain_destroy_l3_mon_state()
> > which frees d->rmid_busy_llc and d->mbm_states[]. After it returns, the
> > caller (e.g. domain_remove_cpu_mon() in arch/x86 or the mpam equivalent)
> > deletes the domain from its list and frees the domain itself.
> >
> > cancel_delayed_work() does not wait for a handler that is already
> > running. mbm_handle_overflow() and cqm_handle_limbo() each acquire
> > rdtgroup_mutex before touching the domain, so a handler that started
> > just before resctrl_offline_mon_domain() runs will block on the mutex.
> > When resctrl_offline_mon_domain() drops the mutex, the handler wakes
> > up with a stale 'd' obtained via container_of() and dereferences memory
> > that has just been freed.
> >
> > Drain the handlers with cancel_delayed_work_sync() so no handler can be
> > running or pending against the domain when its state is freed:
> >
> > - Add an 'offlining' flag to struct rdt_l3_mon_domain. Under
> > rdtgroup_mutex, resctrl_offline_mon_domain() sets it before
> > dropping the mutex; the handlers test it after acquiring the
> > mutex and exit without rescheduling. This guarantees that
> > cancel_delayed_work_sync() does not race with the handler
> > re-arming itself.
> >
> > - Drop cpus_read_lock() from mbm_handle_overflow() and
> > cqm_handle_limbo(). resctrl_offline_mon_domain() can be invoked
> > from a CPU hotplug callback that holds the hotplug write lock;
> > a handler blocked on cpus_read_lock() in that window would
> > deadlock cancel_delayed_work_sync(). The data the handlers
> > examine is protected by rdtgroup_mutex, and
> > schedule_delayed_work_on() copes with a target CPU that is going
> > offline by migrating the work, so the cpus_read_lock() was not
> > required for correctness.
> >
> > - Restructure resctrl_offline_mon_domain() to: set ->offlining and
> > remove the mondata directories under rdtgroup_mutex; drop the
> > mutex; cancel_delayed_work_sync() both handlers; reacquire the
> > mutex to do the final force __check_limbo() and free the
> > per-domain monitor state. The cancel must run with the mutex
> > released because the handlers acquire it. Cancel both handlers
> > unconditionally on the L3 path (subject to the feature being
> > enabled) rather than gating cqm_limbo on has_busy_rmid(): a
> > handler may already be executing __check_limbo() with no busy
> > RMIDs left, and that invocation must be drained before its 'd'
> > is freed.
> >
> > Fixes: 24247aeeabe9 ("x86/intel_rdt/cqm: Improve limbo list processing")
> > Assisted-by: Copilot:claude-opus-4.7
> > Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx>
> > Link: https://sashiko.dev/#/patchset/20260429184858.36423-1-tony.luck%40intel.com [1]
> > ---
> > include/linux/resctrl.h | 1 +
> > fs/resctrl/monitor.c | 18 ++++++++++--------
> > fs/resctrl/rdtgroup.c | 38 ++++++++++++++++++++++++++++++++++----
> > 3 files changed, 45 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> > index 006e57fd7ca5..73f2638b96ad 100644
> > --- a/include/linux/resctrl.h
> > +++ b/include/linux/resctrl.h
> > @@ -203,6 +203,7 @@ struct rdt_l3_mon_domain {
> > int mbm_work_cpu;
> > int cqm_work_cpu;
> > struct mbm_cntr_cfg *cntr_cfg;
> > + bool offlining;
> > };
> >
> > /**
> > diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> > index 9fd901c78dc6..e68eec83306e 100644
> > --- a/fs/resctrl/monitor.c
> > +++ b/fs/resctrl/monitor.c
> > @@ -794,11 +794,14 @@ void cqm_handle_limbo(struct work_struct *work)
> > unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
> > struct rdt_l3_mon_domain *d;
> >
> > - cpus_read_lock();
> > mutex_lock(&rdtgroup_mutex);
> >
> > d = container_of(work, struct rdt_l3_mon_domain, cqm_limbo.work);
>
> Since work always runs on a CPU belonging to the domain, could it be simpler to use
> get_mon_domain_from_cpu() using the CPU running the work to obtain the domain here
> instead of the work contained in the domain struct?

Is this true? When a CPU is taken offline Linux picks another CPU to run
any unexpired queued work. No guarantee that new CPU is in the same
domain. I think a robust solution is going to need a check that the
delayed work handlers are on the right domain. It looks like existing
code doesn't handle this well.

>
> This seems to more closely match the pattern used in rdtgroup_mondata_show() that
> stores the domain ID in its state instead of a pointer to the domain and then uses
> resctrl_find_domain() to find domain.
>
> >
> > + /* If this domain is being deleted this work no longer needs to run. */
> > + if (d->offlining)
> > + goto out_unlock;
> > +

Claude seemed quite confident about removal of cpus_read_lock() in these
functions. Sashiko is confident that this has opened up several new race
conditions:

https://sashiko.dev/#/patchset/20260501213611.25600-1-tony.luck%40intel.com

Maybe we need to add a reference count to the rdt_l3_mon_domain
structure and delay freeing it until the last user is gone?

>
> Reinette

-Tony