Re: [PATCH v3 1/9] fs/resctrl: Fix MPAM Partid parsing errors by preserving CDP state during umount

From: Ben Horgan

Date: Fri Mar 20 2026 - 13:13:33 EST

Hi Zeng,

On 3/17/26 13:21, Zeng Heng wrote:
> This patch fixes a pre-existing issue in the resctrl filesystem teardown
> sequence where premature clearing of cdp_enabled could lead to MPAM Partid
> parsing errors.
>
> The closid to partid conversion logic inherently depends on the global
> cdp_enabled state. However, rdt_disable_ctx() clears this flag early in
> the umount path, while free_rmid() operations will reference after that.
> This creates a window where partid parsing operates with inconsistent CDP
> state, potentially make monitor reads with wrong partid mapping.
>
> Additionally, rmid_entry remaining in limbo between mount sessions may
> trigger potential partid out-of-range errors, leading to MPAM fault
> interrupts and subsequent MPAM disablement.
>
> Reorder rdt_kill_sb() to delay rdt_disable_ctx() until after
> rmdir_all_sub() and resctrl_fs_teardown() complete. This ensures
> all rmid-related operations finish with correct CDP state.
>
> Introduce rdt_flush_limbo() to flush and cancel limbo work before the
> filesystem teardown completes. An alternative approach would be to cancel

The code looks correct but it does introduce a subtle change of behaviour which
may or may not be acceptable. A busy rmid may now be allocated after remount.
Clean rmids were never guaranteed, e.g. when a domain goes offline, but this
weakens the guarantee.

> limbo work on umount and restart it on remount with remaked bitmap.
> However, this would require substantial changes in the resctrl layer to
> handle CDP state transitions across mount sessions, which is beyond the
> scope of the reqpartid feature work this patchset focuses on. The current

Another option to consider is whether limbo could be replaced by checking whether
an rmid is busy at allocation.

Do your changes here to resctrl_arch_rmid_idx_encode() have an impact on how
limbo works?

Thanks,

Ben

> fix addresses the immediate correctness issue with minimal churn.
>
> Signed-off-by: Zeng Heng <zengheng4@xxxxxxxxxx>
> ---
> fs/resctrl/rdtgroup.c | 24 ++++++++++++++++++++++--
> 1 file changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 5da305bd36c9..bc0735eef92a 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -3165,6 +3165,25 @@ static void resctrl_fs_teardown(void)
> rdtgroup_destroy_root();
> }
>
> +static void rdt_flush_limbo(void)
> +{
> + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> + struct rdt_l3_mon_domain *d;
> +
> + if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
> + return;
> +
> + if (!resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID))
> + return;
> +
> + list_for_each_entry(d, &r->mon_domains, hdr.list) {
> + if (has_busy_rmid(d)) {
> + __check_limbo(d, true);
> + cancel_delayed_work(&d->cqm_limbo);
> + }
> + }
> +}
> +
> static void rdt_kill_sb(struct super_block *sb)
> {
> struct rdt_resource *r;
> @@ -3172,13 +3191,14 @@ static void rdt_kill_sb(struct super_block *sb)
> cpus_read_lock();
> mutex_lock(&rdtgroup_mutex);
>
> - rdt_disable_ctx();
> -
> /* Put everything back to default values. */
> for_each_alloc_capable_rdt_resource(r)
> resctrl_arch_reset_all_ctrls(r);
>
> resctrl_fs_teardown();
> + rdt_flush_limbo();
> + rdt_disable_ctx();
> +
> if (resctrl_arch_alloc_capable())
> resctrl_arch_disable_alloc();
> if (resctrl_arch_mon_capable())