Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling

From: Ben Horgan

Date: Thu Feb 12 2026 - 08:55:43 EST


Hi Reinette, Tony, Babu,

On Wed, Feb 11, 2026 at 02:22:55PM -0800, Reinette Chatre wrote:
> Hi Ben,
>
> On 2/11/26 8:40 AM, Ben Horgan wrote:
> > On Tue, Feb 10, 2026 at 10:04:48AM -0800, Reinette Chatre wrote:
> >> On 2/10/26 8:17 AM, Reinette Chatre wrote:
> >>> On 1/28/26 9:44 AM, Moger, Babu wrote:
> >>>>
> >>>>
> >>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
> >>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
> >>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
> >>>>>> Babu,
> >>>>>>
> >>>>>> I've read a bit more of the code now and I think I understand more.
> >>>>>>
> >>>>>> Some useful additions to your explanation.
> >>>>>>
> >>>>>> 1) Only one CTRL group can be marked as PLZA
> >>>>>
> >>>>> Yes. Correct.
> >>>
> >>> Why limit it to one CTRL_MON group and why not support it for MON groups?
> >>>
> >>> Limiting it to a single CTRL group seems restrictive in a few ways:
> >>> 1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
> >>> number of use cases that can be supported. Consider, for example, an existing
> >>> "high priority" resource group and a "low priority" resource group. The user may
> >>> just want to let the tasks in the "low priority" resource group run as "high priority"
> >>> when in CPL0. This of course may depend on what resources are allocated, for example
> >>> cache may need more care, but if, for example, user is only interested in memory
> >>> bandwidth allocation this seems a reasonable use case?
> >>> 2) Similar to what Tony [1] mentioned this does not enable what the hardware is
> >>> capable of in terms of number of different control groups/CLOSID that can be
> >>> assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
> >>> 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
> >>> MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
> >>> example, create a resource group that contains tasks of interest and create
> >>> a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
> >>> This will give user space better insight into system behavior and from what I can
> >>> tell is supported by the feature but not enabled?
> >>>
> >>>>>
> >>>>>> 2) It can't be the root/default group
> >>>>>
> >>>>> This is something I added to keep the default group in a un-disturbed,
> >>>
> >>> Why was this needed?
> >>>
> >>>>>
> >>>>>> 3) It can't have sub monitor groups
> >>>
> >>> Why not?
> >>>
> >>>>>> 4) It can't be pseudo-locked
> >>>>>
> >>>>> Yes.
> >>>>>
> >>>>>>
> >>>>>> Would a potential use case involve putting *all* tasks into the PLZA group? That
> >>>>>> would avoid any additional context switch overhead as the PLZA MSR would never
> >>>>>> need to change.
> >>>>>
> >>>>> Yes. That can be one use case.
> >>>>>
> >>>>>>
> >>>>>> If that is the case, maybe for the PLZA group we should allow user to
> >>>>>> do:
> >>>>>>
> >>>>>> # echo '*' > tasks
> >>>
> >>> Dedicating a resource group to "PLZA" seems restrictive while also adding many
> >>> complications since this designation makes resource group behave differently and
> >>> thus the files need to get extra "treatments" to handle this "PLZA" designation.

As I commented on another thread, I'm wary of this reuse of existing file types
as they can confuse existing user-space tools.

> >>>
> >>> I am wondering if it will not be simpler to introduce just one new file, for example
> >>> "tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
> >>> file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
> >>> task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
> >>> resource group to manage user space and kernel space allocations while also supporting
> >>> various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
> >>> use case where user space can create a new resource group with certain allocations but the
> >>> "tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
> >>> the resource group's allocations when in CPL0.
> >
> > If there is a "tasks_cpl0" then I'd expect a "cpus_cpl0" too.
>
> That is reasonable, yes.

I think the "tasks_cpl0" approach suffers from one of the same faults as the
"kernel_groups" approach. If you want to run a task with userspace configuration
closid-A rmid-Y but to run in kernel space in closid-B but the same rmid-Y then
there can't exist monitor_group in resctrl for both.

>
> >> It looks like MPAM has a few more capabilities here and the Arm levels are numbered differently
> >> with EL0 meaning user space. We should thus aim to keep things as generic as possible. For example,
> >> instead of CPL0 using something like "kernel" or ... ?
> >
> > Yes, PLZA does open up more possibilities for MPAM usage. I've talked to James
> > internally and here are a few thoughts.
> >
> > If the user case is just that an option run all tasks with the same closid/rmid
> > (partid/pmg) configuration when they are running in the kernel then I'd favour a
> > mount option. The resctrl filesytem interface doesn't need to change and
>
> I view mount options as an interface of last resort. Why would a mount option be needed
> in this case? The existence of the file used to configure the feature seems sufficient?

If we are taking away a closid from the user then the number of CTRL_MON groups
that can be created changes. It seems reasonable for user-space to expect
num_closid to be a fixed value.

>
> Also ...
>
> I do not think resctrl should unnecessarily place constraints on what the hardware
> features are capable of. As I understand, both PLZA and MPAM supports use case where
> tasks may use different CLOSID/RMID (PARTID/PMG) when running in the kernel. Limiting
> this to only one CLOSID/PARTID seems like an unmotivated constraint to me at the moment.
> This may be because I am not familiar with all the requirements here so please do
> help with insight on how the hardware feature is intended to be used as it relates
> to its design.
>
> We have to be very careful when constraining a feature this much If resctrl does something
> like this it essentially restricts what users could do forever.

Indeed, we don't want to unnecessarily restrict ourselves here. I was hoping a
fixed kernel CLOSID/RMID configuration option might just give all we need for
usecases we know we have and be minimally intrusive enough to not preclude a
more featureful PLZA later when new usecases come about.

One complication is that for fixed kernel CLOSID/RMID option is that for x86 you
may want to be able to monitor a tasks resource usage whether or not it is in
the kernel or userspace and so only have a fixed CLOSID. However, for MPAM this
wouldn't work as PMG (~RMID) is scoped to PARTID (~CLOSID).

>
> > userspace software doesn't need to change. This could either take away a
> > closid/rmid from userspace and dedicate it to the kernel or perhaps have a
> > policy to have the default group as the kernel group. If you use the default
>
> Similar to above I do not see PLZA or MPAM preventing sharing of CLOSID/RMID (PARTID/PMG)
> between user space and kernel. I do not see a motivation for resctrl to place such
> constraint.
>
> > configuration, at least for MPAM, the kernel may not be running at the highest
> > priority as a minimum bandwidth can be used to give a priority boost. (Once we
> > have a resctrl schema for this.)
> >
> > It could be useful to have something a bit more featureful though. Is there a
> > need for the two mappings, task->cpl0 config and task->cpl1 to be independent or
> > would as task->(cp0 config, cp1 config) be sufficient? It seems awkward that
> > it's not a single write to move a task. If a single mapping is sufficient, then
>
> Moving a task in x86 is currently two writes by writing the CLOSID and RMID separately.
> I think the MPAM approach is better and there may be opportunity to do this in a similar
> way and both architectures use the same field(s) in the task_struct.

I was referring to the userspace file write but unifying on a the same fields in
task_struct could be good. The single write is necessary for MPAM as PMG is
scoped to PARTID and I don't think x86 behaviour changes if it moves to the same
approach.

>
> > as single new file, kernel_group,per CTRL_MON group (maybe MON groups) as
> > suggested above but rather than a task that file could hold a path to the
> > CTRL_MON/MON group that provides the kernel configuraion for tasks running in
> > that group. So that this can be transparent to existing software an empty string
>
> Something like this would force all tasks of a group to run with the same CLOSID/RMID
> (PARTID/PMG) when in kernel space. This seems to restrict what the hardware supports
> and may reduce the possible use case of this feature.
>
> For example,
> - There may be a scenario where there is a set of tasks with a particular allocation
> when running in user space but when in kernel these tasks benefit from different
> allocations. Consider for example below arrangement where tasks 1, 2, and 3 run in
> user space with allocations from resource_groupA. While these tasks are ok with this
> allocation when in user space they have different requirements when it comes to
> kernel space. There may be a resource_groupB that allocates a lot of resources ("high
> priority") that task 1 should use for kernel work and a resource_groupC that allocates
> fewer resources that tasks 2 and 3 should use for kernel work ("medium priority").
>
> resource_groupA:
> schemata: <average allocations that work for tasks 1, 2, and 3 when in user space>
> tasks when in user space: 1, 2, 3
>
> resource_groupB:
> schemata: <high priority allocations>
> tasks when in kernel space: 1
>
> resource_groupC:
> schemata: <medium priority allocations>
> tasks when in kernel space: 2, 3

I'm not sure if this would happen in the real world or not.

>
> If user space is forced to have the same tasks have the same user space and kernel
> allocations then that will force user space to create additional resource groups that
> will use up CLOSID/PARTID that is a scarce resource.

This may be undesirable even if CLOSID/PARTID were unlimited as controls which set
a per-CLOSID/PARTID maximum don't have the same effect if the tasks are spread across
more than one CLOSID/PARTID.

>
> - There may be a scenario where the user is attempting to understand system behavior by
> monitoring individual or subsets of tasks' bandwidth usage when in kernel space.

This seems useful to me.

>
> - From what I can tell PLZA also supports *different* allocations when in user vs
> kernel space while using the *same* monitoring group for both. This does not seem
> transferable to MPAM and would take more effort to support in resctrl but it is
> a use case that the hardware enables.

Ah yes, I think this ends the 'kernel_group' idea then. I was too focused on
MPAM and forgotten to consider the case where PMG and PARTID are independent.

>
> When enabling a feature I would of course prefer not to add unnecessary complexity. Even so,
> resctrl is expected to expose hardware capabilities to user space. There seems to be some
> opinions on how user space will now and forever interact with these features that
> are not clear to me so I would appreciate more insight in why these constraints are
> appropriate.

Yes, care definitely needs to be taken here in order to not back ourselves into
a corner.

>
> Reinette
>
> > can mean use the current group's when in the kernel (as well as for
> > userspace). A slash, /, could be used to refer to the default group. This would
> > give something like the below under /sys/fs/resctrl.
> >
> > .
> > ├── cpus
> > ├── tasks
> > ├── ctrl1
> > │   ├── cpus
> > │   ├── kernel_group -> mon_groups/mon1
> > │   └── tasks
> > ├── kernel_group -> ctrl1
> > └── mon_groups
> > └── mon1
> > ├── cpus
> > ├── kernel_group -> ctrl1
> > └── tasks
> >
> >>
> >> I have not read anything about the RISC-V side of this yet.
> >>
> >> Reinette
> >>
> >>>
> >>> Reinette
> >>>
> >>> [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
> >>
> >
> > Thanks,
> >
> > Ben
>

Thanks,

Ben