Re: [PATCH v8 next 01/10] fs/resctrl: Fix MPAM Partid parsing errors by preserving CDP state during umount
From: Zeng Heng
Date: Wed May 20 2026 - 08:29:02 EST
Hi James,
On 2026/5/15 1:06, James Morse wrote:
Hi Zeng,
I think this should be a separate patch as its fixing a problem not adding a feature. It's
not actually relevant to the rest of the series.
The intention behind this fix is that reqPARTID would end up the same as
the original PARTID, because the conversion between RMID and reqPARTID
relies on the `cdp_enabled` variable. Hence, I attempted to also
resolve this existing problem with the patch.
On 13/04/2026 09:53, Zeng Heng wrote:
This patch fixes a pre-existing issue in the resctrl filesystem teardown
sequence where premature clearing of cdp_enabled could lead to MPAM Partid
parsing errors.
resctrl changes need to go via tip, which has a bunch of rules about commit messages,
see Documentation/process/maintainer-tip.rst
You end up with a structure describing the current state, e.g:
| When resctrl is umounted it disables CDP,
what the problem is, e.g:
| CLOSID remain in the limbo list, and the mbm monitors continue to be read
| after umount. MPAM changes the meaning of CLOSID when CDP is enabled/disabled,
| resulting in out of bounds accesses.
Then, what you do about it, here you are:
| Throwing away the limbo list on umount.
(I don't suggest you take this wording - its just an example)
"this patch" is a phrase to avoid, acronyms like CLOSID need capitalising, etc.
Thanks for the details, I'll rework the commit to follow these
guidelines.
The closid to partid conversion logic inherently depends on the global
cdp_enabled state. However, rdt_disable_ctx() clears this flag early in
the umount path, while free_rmid() operations will reference after that.
This creates a window where partid parsing operates with inconsistent CDP
state, potentially makes monitor reads with wrong partid mapping.
Additionally, rmid_entry remaining in limbo between mount sessions may
trigger potential partid out-of-range errors, leading to MPAM fault
interrupts and subsequent MPAM disablement.
Can you give more details on this. I assume its going from CDP-disable to
enabled, means MPAM doubles the CLOSID from the stale limbo list, making it
out of range.
Get it, I would explain that.
Reorder rdt_kill_sb() to delay rdt_disable_ctx() until after
rmdir_all_sub() and resctrl_fs_teardown() complete. This ensures
all rmid-related operations finish with correct CDP state.
Introduce rdt_flush_limbo() to flush and cancel limbo work before the
filesystem teardown completes.
So, discard the state in the hope we don't need it again.
What happens if the filesystem is mounted again quickly afterwards?
Surely we get noisy bandwidth results for ~minutes afterwards?
An alternative approach would be to cancel limbo work on umount
Sounds like a move in the right direction - having bits of resctrl still
taking CPU time when its not in use is surprising.
I'd love to eventually remove the limbo worker and have the RMID alloc code
search the limbo list for a clean RMID when a control/monitor group is created.
By deferring the work as late as possible, we do less work overall.
and restart it on remount with remaked bitmap.
However, this would require substantial changes in the resctrl layer to
handle CDP state transitions across mount sessions,
This would be necessary if the limbo timer was stopped on umount too.
It also covers cases where you kexec and re-mount resctrl.
I think this is a good idea. I agree its more work.
which is beyond the
scope of the reqpartid feature work this patchset focuses on.
Was it a mistake to include it in this series then?
The current
fix addresses the immediate correctness issue with minimal churn.
I'm not a fan of papering over problems in resctrl. Could we do it properly
by rebuilding the limbo list at mount time as you suggested above?
I discussed this with Ben earlier, and the remake bitmap approach was
actually his proposal:
https://lore.kernel.org/all/b95077d7-c036-4a8f-8e42-8f1dc0288075@xxxxxxx/
Best regards,
Zeng Heng