Re: [PATCH] fs/resctrl: Fix use-after-free in resctrl_offline_mon_domain()

From: Luck, Tony

Date: Tue May 05 2026 - 19:07:57 EST

On Tue, May 05, 2026 at 02:26:11PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 5/5/26 9:45 AM, Luck, Tony wrote:
> > On Mon, May 04, 2026 at 09:39:40PM -0700, Reinette Chatre wrote:
>
> ...
> > Here's the scenario:
> >
> > There is just one CPU left in a domain. The mbm_over worker is woken on
> > that CPU just as user requests to take that CPU offline (executing on
> > another CPU).
> >
> > There is a race. mbm_handle_overflow() has begun execution, but the
> > offline process has taken cpus_write_lock() so it blocks and sleeps
> > waiting for cpus_read_lock().
> >
> > The offline process calls:
> >
> > resctrl_arch_offline_cpu()
> > -> resctrl_offline_cpu
> > -> cancel_delayed_work(&d->mbm_over); /* not the _sync version */
>
> Here I expect d->mbm_work_cpu to be set to this one CPU that is left in the domain.
>
> > -> mbm_setup_overflow_handler(d, 0, cpu);
> > Finds there are other CPUs in the domain
>
> I do not think that this will find other CPUs in the domain here since
> the scenario starts with there being only one CPU left in the domain and it is
> currently running the overflow handler . From what I can tell, in this scenario,
> mbm_setup_overflow_handler() will return without scheduling the overflow handler
> on any CPU.
>
> After mbm_setup_overflow_handler() returns I expect d->mbm_work_cpu to be set to
> nr_cpus_ids.
>
> This detail does not impact the race you are highlighting though.
>
> > -> domain_remove_cpu()
> > -> domain_remove_cpu_mon()
> > clears bit for this CPU from hdr->cpu_mask and finds mask is empty
> > -> resctrl_offline_mon_domain()
> > -> cancel_delayed_work(&d->mbm_over); /* Again! Still not _sync version */
> > -> domain_destroy_l3_mon_state()
> > -> kfree(d->mbm_states[idx]); /* file system state */
> > -> removes domain from L3 resource domain list
> > -> l3_mon_domain_free()
> > kfree(hw_dom->arch_mbm_states[idx]) /* architecture state */
> > kfree(hw_dom)
> >
> > Offline process continues. One of the things it does is checks for
> > orphans on the run queue of the now deceased CPU and adopts them.
> >
> > Finally the offline process completes and does cpus_write_unlock()
> >
> > Now mbm_handle_overflow() can continue. But it is on the wrong CPU
> > and has a pointer to a work_struct that was freed by the offline
> > process.
>
> Thank you for the explanation, I did not consider this scenario. From what I understand
> folks encountering this scenario should also encounter a splat from the MBM overflow
> handler from all the places where it runs smp_processor_id(). Well, actually, if these
> people are running with CONFIG_DEBUG_PREEMPT enabled because of the
> debug_smp_processor_id()->check_preemption_disabled()->is_percpu_thread() check.
>
> ...
>
> >> I still think that using get_mon_domain_from_cpu() in the workers could work here.
> >> Here is the idea more specifically for the MBM overflow handler:
> >>
> >> diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c
> >> index 88f1fa0b9d8d..7186d6d02d6e 100644
> >> --- a/fs/resctrl/monitor.c
> >> +++ b/fs/resctrl/monitor.c
> >> @@ -856,7 +856,9 @@ void mbm_handle_overflow(struct work_struct *work)
> >> goto out_unlock;
> >>
> >> r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> >> - d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work);
> >> + d = get_mon_domain_from_cpu(smp_processor_id(), r);
> >> + if (!d)
> >> + goto out_unlock;
> >
> > This will get a domain, but it will be for the wrong CPU.
>
> If it was migrated yes. From what I understand when this happens it stops being
> a "per-CPU thread" so how about something like:
>
> mbm_handle_overflow()
> {
>
> ...
> cpus_read_lock();
> mutex_lock(&rdtgroup_mutex);
>
> /*
> * Overflow handler migrated during race while CPU went offline and
> * no longer running on intended CPU.
> */
> if (!is_percpu_thread())
> goto unlock;

This is neat. It works well in the case with the above race on the last
CPU in a domain going offline. But I think there are problems if there
are other CPUs available, and user takes d->mbm_work_cpu or d->cqm_work_cpu
offline.

Here the corner case semantics of repeated calls to schedule_delayed_work_on()
are important. In this case resctrl_offline_cpu() will call:

cancel_delayed_work(&d->mbm_over); // Not sync, so running WQ keeps going
mbm_setup_overflow_handler(d, 0, cpu);
There are other CPUs available in the domain. Picks one:
dom->mbm_work_cpu = cpu;
schedule_delayed_work_on(cpu, &dom->mbm_over, delay);

This sees that work is already running and returns
%false without doing anything.

When offline of the CPU completes, the worker runs, finds it is no
longer a "is_percpu_thread()" and returns without scheduling future
execution. So checking for overflow on this domain is disabled.

> ...
> out_unlock:
> mutex_unlock(&rdtgroup_mutex);
> cpus_read_unlock();
> }
>
>
> I am not clear on whether there are more than one race here now. From the flow
> you describe it seems that when mbm_handle_overflow() runs on its intended CPU then
> it can assume that its associated domain has not been removed and thus that it is
> running with a valid work_struct? More specifically, if is_percpu_thread() is true
> then "d = container_of(work, struct rdt_l3_mon_domain, mbm_over.work)" will work?
> I am trying to match this race involving the CPU hotplug lock with the race described
> in original changelog that involves rdtgroup_mutex here ...
>
> >
> > Maybe it doesn't hurt for it to perform an extra set of mbm_update()
> > calls on that CPU?
>
> The extra mbm_update() calls seem ok. An extra call to limbo handler should be ok also.
> One possible issue is the impact on software controller that assumes it is being
> called once per second. Looks like an extra call can be avoided though?
>
> >
> > It will then do:
> >
> > d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask,
> > RESCTRL_PICK_ANY_CPU);
> > schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);
> >
> > This "wrong" domain already has a worker ... Will this just reset the
> > timeout to the new "delay" value? Possibly also to a different CPU?
>
> queue_delayed_work_on()'s comments mention "Return: %false if @work was
> already on a queue" which I interpret as existing (correct) worker not being
> impacted. I am not familiar with these corner cases though.

Yes. Existing worker is not impacted. Which causes the problem described above.

> Reinette

-Tony