Re: [PATCH v3 9/9] fs/resctrl: Fix UAF from worker threads when domains are removed

From: Luck, Tony

Date: Tue May 26 2026 - 14:33:28 EST

On Tue, May 26, 2026 at 10:53:59AM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 5/26/26 8:32 AM, Luck, Tony wrote:
>
> Instead of deleting the patch without any comments, could you please provide some
> insight to what problems it has and why your solution is better?
>
> > On Fri, May 22, 2026 at 12:15:13PM -0700, Reinette Chatre wrote:
> >> - Adding a reference count to the domain structure to avoid the worker
> >> needing to take CPU hotplug lock. This ended up being very complicated
> >> with the architecture needing new APIs to manage the reference count
> >> which cannot cleanly integrate into MPAM since it uses a single
> >> architecture domain structure to contain both the control and monitoring
> >> domain structures. Managing the references across mount, unmount,
> >> online, offline, as well as worker self exit resulted in several
> >> asymmetrical and complicated paths that were error prone. Locking also
> >> proved to be complicated since architecture would need to initiate
> >> domain free that will need to call back into resctrl that will take
> >> rdtgroup_mutex which means that references need to be taken/released
> >> without locking.
> >
> > I'd been working on a reference count approach too. The MPAM combined
> > domain for control and monitoring doesn't seem insurmountable. Mostly
>
> While technically possible I do not think it is a clean solution to have
> the lifetime of the control domain be controlled with a reference in the
> monitoring domain.
>
> > because it seems unlikely that the problem with worker threads would
> > ever apply to control domains. Maybe I missed something, but just adding
> > an architecture *release() function that can be used by file system code
> > to drop reference counts on the domain when worker threads exit seems
> > enough.
>
> Did you consider the locking implications that I mention in the description
> you quoted? More below ...

I didn't run into any lockdep splats during testing. But maybe didn't
have enough code coverage to poke into corner cases.
>
> >
> > My patch below.
>
> heh
>
> >
> > -Tony
> >
> >
> > From 611fd8ad816abd37ef9a65b39175ce05907a1d41 Mon Sep 17 00:00:00 2001
> > From: Tony Luck <tony.luck@xxxxxxxxx>
> > Date: Thu, 21 May 2026 15:14:27 -0700
> > Subject: [PATCH] fs,mpam,x86/resctrl: Track reference count for L3 monitor
> > domains
> >
> > There are race conditions[1] when the last CPU of a domain is taken offline
> > and a worker thread may access the domain structure after it is freed.
> >
> > Add a rdt_l3_mon_domain::kref to track users of the domain. Don't try
> > to cancel worker threads when CPUs are taken offline. Just set the
> > target CPU for the thread to nr_cpu_ids to indicate the worker needs
> > to take action next time it runs.
>
> One test I have found to be useful when digging into this is to offline all CPUs
> of a domain starting with lowest number. Since overflow worker runs on lowest number
> and then is moved to next CPU when it goes offline this stresses this new mechanics.
> Have you tried something similar or could you try this test with this solution?

Yes. My test case ran through all CPUs in the domain in order. That
showed an interesting artifact that when CPU 36 goes offline, the worker
next runs on CPU37, which is the place it needs to be, but it isn't
bound to CPU37. But since d->mbm_work_cpu (or d->cqm_work_cpu) is set
to nr_cpus_id the worker calls schedule_delayed_work_on() to get itself
properly bound to CPU37.
>
> ...
>
> > @@ -680,18 +692,16 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r)
> >
> > switch (r->rid) {
> > case RDT_RESOURCE_L3: {
> > - struct rdt_hw_l3_mon_domain *hw_dom;
> > struct rdt_l3_mon_domain *d;
> >
> > if (!domain_header_is_valid(hdr, RESCTRL_MON_DOMAIN, RDT_RESOURCE_L3))
> > return;
> >
> > d = container_of(hdr, struct rdt_l3_mon_domain, hdr);
> > - hw_dom = resctrl_to_arch_mon_dom(d);
> > resctrl_offline_mon_domain(r, hdr);
> > list_del_rcu(&hdr->list);
> > synchronize_rcu();
> > - l3_mon_domain_free(hw_dom);
> > + kref_put(&d->kref, resctrl_arch_l3_mon_domain_release);
> > break;
> > }
>
> To me the idea behind a "domain reference count" is to provide guarantee to any holder of
> a reference that the domain *and* its data remains accessible while it holds the reference.
> There is the domain structure itself and then the architecture specific, for example,
> rdt_hw_l3_mon_domain::arch_mbm_states, and the fs state, for example rdt_l3_mon_domain::mbm_states.

Agreed. I'm not guaranteeing that the whole of the domain structure
(with all the substructures that it points to) are valid. This reference
only promises that the mbm_over, cqm_limbo, mbm_work_cpu, and cqm_work_cpu
fields are still valid. That's enough to solve this issue, but if we
were to adopt this patch would need some comments so that future readers
were not led astray.

> Above snippet moves the freeing of the *architecture* state to be called on kref_put()
> while *always* freeing the fs state (resctrl_offline_mon_domain()->domain_destroy_l3_mon_state()).
>
> A worker may thus have a reference to the domain but when it runs it runs without
> fs state which is just a new use-after-free.

For this specific case, the worker sees "nr_cpus_id" and avoids use of
other fields that may no longer be valid.

> As I mentioned in the description the release managed by architecture implies that
> reference needs to be dropped without rdtgroup_mutex held since the architecture
> should also call resctrl_offline_mon_domain() as part of release. Locking needs
> to be reworked and needs to adhere to kref rules on kref_get()/kref_put() without
> locking.

That's all handled in the existing (unchanged) offline path. The worker
threads only need to call kref_put() to release their hold on the domain
structure that has been mostly freed (and likely has no hold from the
file system by this point).

The release functions in both X86 and MPAM don't need locks, they are
just calling kfree()
>
> Reinette

I see that sashiko found no issues in parts 1-5 of your series (Yay!)
Grumbled about some issues in part 6. And then gave up before getting to
your latest solution to the domain offline problem.

Can you repost just part 9 (or maybe parts 7-9) on top of latest -rc to
see if sashiko is happy at last?

-Tony