Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling

From: Reinette Chatre

Date: Thu Feb 12 2026 - 13:38:13 EST


Hi Ben,

On 2/12/26 5:55 AM, Ben Horgan wrote:
> On Wed, Feb 11, 2026 at 02:22:55PM -0800, Reinette Chatre wrote:
>> On 2/11/26 8:40 AM, Ben Horgan wrote:
>>> On Tue, Feb 10, 2026 at 10:04:48AM -0800, Reinette Chatre wrote:
>>>> On 2/10/26 8:17 AM, Reinette Chatre wrote:
>>>>> On 1/28/26 9:44 AM, Moger, Babu wrote:
>>>>>>
>>>>>>
>>>>>> On 1/28/2026 11:41 AM, Moger, Babu wrote:
>>>>>>>> On Wed, Jan 28, 2026 at 10:01:39AM -0600, Moger, Babu wrote:
>>>>>>>>> On 1/27/2026 4:30 PM, Luck, Tony wrote:
>>>>>>>> Babu,
>>>>>>>>
>>>>>>>> I've read a bit more of the code now and I think I understand more.
>>>>>>>>
>>>>>>>> Some useful additions to your explanation.
>>>>>>>>
>>>>>>>> 1) Only one CTRL group can be marked as PLZA
>>>>>>>
>>>>>>> Yes. Correct.
>>>>>
>>>>> Why limit it to one CTRL_MON group and why not support it for MON groups?
>>>>>
>>>>> Limiting it to a single CTRL group seems restrictive in a few ways:
>>>>> 1) It requires that the "PLZA" group has a dedicated CLOSID. This reduces the
>>>>> number of use cases that can be supported. Consider, for example, an existing
>>>>> "high priority" resource group and a "low priority" resource group. The user may
>>>>> just want to let the tasks in the "low priority" resource group run as "high priority"
>>>>> when in CPL0. This of course may depend on what resources are allocated, for example
>>>>> cache may need more care, but if, for example, user is only interested in memory
>>>>> bandwidth allocation this seems a reasonable use case?
>>>>> 2) Similar to what Tony [1] mentioned this does not enable what the hardware is
>>>>> capable of in terms of number of different control groups/CLOSID that can be
>>>>> assigned to MSR_IA32_PQR_PLZA_ASSOC. Why limit PLZA to one CLOSID?
>>>>> 3) The feature seems to support RMID in MSR_IA32_PQR_PLZA_ASSOC similar to
>>>>> MSR_IA32_PQR_ASSOC. With this, it should be possible for user space to, for
>>>>> example, create a resource group that contains tasks of interest and create
>>>>> a monitor group within it that monitors all tasks' bandwidth usage when in CPL0.
>>>>> This will give user space better insight into system behavior and from what I can
>>>>> tell is supported by the feature but not enabled?
>>>>>
>>>>>>>
>>>>>>>> 2) It can't be the root/default group
>>>>>>>
>>>>>>> This is something I added to keep the default group in a un-disturbed,
>>>>>
>>>>> Why was this needed?
>>>>>
>>>>>>>
>>>>>>>> 3) It can't have sub monitor groups
>>>>>
>>>>> Why not?
>>>>>
>>>>>>>> 4) It can't be pseudo-locked
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>>
>>>>>>>> Would a potential use case involve putting *all* tasks into the PLZA group? That
>>>>>>>> would avoid any additional context switch overhead as the PLZA MSR would never
>>>>>>>> need to change.
>>>>>>>
>>>>>>> Yes. That can be one use case.
>>>>>>>
>>>>>>>>
>>>>>>>> If that is the case, maybe for the PLZA group we should allow user to
>>>>>>>> do:
>>>>>>>>
>>>>>>>> # echo '*' > tasks
>>>>>
>>>>> Dedicating a resource group to "PLZA" seems restrictive while also adding many
>>>>> complications since this designation makes resource group behave differently and
>>>>> thus the files need to get extra "treatments" to handle this "PLZA" designation.
>
> As I commented on another thread, I'm wary of this reuse of existing file types
> as they can confuse existing user-space tools.

I agree. Changing how user space interacts with existing files is a change that would
require a mount option and this can be avoided by using new files instead.

>>>>> I am wondering if it will not be simpler to introduce just one new file, for example
>>>>> "tasks_cpl0" in both CTRL_MON and MON groups. When user space writes a task ID to the
>>>>> file it "enables" PLZA for this task and that group's CLOSID and RMID is the associated
>>>>> task's "PLZA" CLOSID and RMID. This gives user space the flexibility to use the same
>>>>> resource group to manage user space and kernel space allocations while also supporting
>>>>> various monitoring use cases. This still supports the "dedicate a resource group to PLZA"
>>>>> use case where user space can create a new resource group with certain allocations but the
>>>>> "tasks" file will be empty and "tasks_cpl0" contains the tasks needing to run with
>>>>> the resource group's allocations when in CPL0.
>>>
>>> If there is a "tasks_cpl0" then I'd expect a "cpus_cpl0" too.
>>
>> That is reasonable, yes.
>
> I think the "tasks_cpl0" approach suffers from one of the same faults as the
> "kernel_groups" approach. If you want to run a task with userspace configuration
> closid-A rmid-Y but to run in kernel space in closid-B but the same rmid-Y then
> there can't exist monitor_group in resctrl for both.

This assumes that "tasks" and "tasks_cpl0"/"tasks_kernel" have the same rules for
task assignment. When a user assigns a task to the "tasks" file of a MON group it
is required that the task is a member of the parent CTRL_MON group and if so, that
task's CLOSID and RMID are both updated. Theoretically there could be different rules
for task assignment to the "tasks_cpl0"/"tasks_kernel" file that does not place such
restriction and only updates CLOSID when moving to a CTRL_MON group and only updates
RMID when moving to a MON group.

You are correct that resctrl cannot have monitor groups to track such configuration
and there may indeed be some consequences that I have not considered.

I understand this is not something that MPAM can support and I also do not know if this
is even a valid use case. If doing something like this user space will need to take care
since the monitoring data will be presented with the allocations used when tasks are in
user space but also contain the monitoring data for allocations used when tasks are in
kernel space that are tracked in another control group hierarchy (to which I expect the
task's kernel space monitoring can move when the MON group is deleted).


>>>> It looks like MPAM has a few more capabilities here and the Arm levels are numbered differently
>>>> with EL0 meaning user space. We should thus aim to keep things as generic as possible. For example,
>>>> instead of CPL0 using something like "kernel" or ... ?
>>>
>>> Yes, PLZA does open up more possibilities for MPAM usage. I've talked to James
>>> internally and here are a few thoughts.
>>>
>>> If the user case is just that an option run all tasks with the same closid/rmid
>>> (partid/pmg) configuration when they are running in the kernel then I'd favour a
>>> mount option. The resctrl filesytem interface doesn't need to change and
>>
>> I view mount options as an interface of last resort. Why would a mount option be needed
>> in this case? The existence of the file used to configure the feature seems sufficient?
>
> If we are taking away a closid from the user then the number of CTRL_MON groups
> that can be created changes. It seems reasonable for user-space to expect
> num_closid to be a fixed value.

I do you see why we need to take away a CLOSID from the user. Consider a user space that
runs with just two resource groups, for example, "high priority" and "low priority", it seems
reasonable to make it possible to let the "low priority" tasks run with "high priority"
allocations when in kernel space without needing to dedicate a new CLOSID? More reasonable
when only considering memory bandwidth allocation though.

>
>>
>> Also ...
>>
>> I do not think resctrl should unnecessarily place constraints on what the hardware
>> features are capable of. As I understand, both PLZA and MPAM supports use case where
>> tasks may use different CLOSID/RMID (PARTID/PMG) when running in the kernel. Limiting
>> this to only one CLOSID/PARTID seems like an unmotivated constraint to me at the moment.
>> This may be because I am not familiar with all the requirements here so please do
>> help with insight on how the hardware feature is intended to be used as it relates
>> to its design.
>>
>> We have to be very careful when constraining a feature this much If resctrl does something
>> like this it essentially restricts what users could do forever.
>
> Indeed, we don't want to unnecessarily restrict ourselves here. I was hoping a
> fixed kernel CLOSID/RMID configuration option might just give all we need for
> usecases we know we have and be minimally intrusive enough to not preclude a
> more featureful PLZA later when new usecases come about.

Having ability to grow features would be ideal. I do not see how a fixed kernel CLOSID/RMID
configuration leaves room to build on top though. Could you please elaborate?

I wonder if the benefit of the fixed CLOSID/RMID is perhaps mostly in the cost of
context switching which I do not think is a concern for MPAM but it may be for PLZA?

One option to support fixed kernel CLOSID/RMID at the beginning and leave room to build
may be to create the kernel_group or "tasks_kernel" interface as a baseline but in first
implementation only allow user space to write the same group to all "kernel_group" files or
to only allow to write to one of the "tasks_kernel" files in the resctrl fs hierarchy. At
that time the associated CLOSID/RMID would become the "fixed configuration" and attempts to
write to others can return "ENOSPC"?

>From what I can tell this still does not require to take away a CLOSID/RMID from user space
though. Dedicating a CLOSID/RMID to kernel work can still be done but be in control of user
that can, for example leave the "tasks" and "cpus" files empty.

> One complication is that for fixed kernel CLOSID/RMID option is that for x86 you
> may want to be able to monitor a tasks resource usage whether or not it is in
> the kernel or userspace and so only have a fixed CLOSID. However, for MPAM this
> wouldn't work as PMG (~RMID) is scoped to PARTID (~CLOSID).
>
>>
>>> userspace software doesn't need to change. This could either take away a
>>> closid/rmid from userspace and dedicate it to the kernel or perhaps have a
>>> policy to have the default group as the kernel group. If you use the default
>>
>> Similar to above I do not see PLZA or MPAM preventing sharing of CLOSID/RMID (PARTID/PMG)
>> between user space and kernel. I do not see a motivation for resctrl to place such
>> constraint.
>>
>>> configuration, at least for MPAM, the kernel may not be running at the highest
>>> priority as a minimum bandwidth can be used to give a priority boost. (Once we
>>> have a resctrl schema for this.)
>>>
>>> It could be useful to have something a bit more featureful though. Is there a
>>> need for the two mappings, task->cpl0 config and task->cpl1 to be independent or
>>> would as task->(cp0 config, cp1 config) be sufficient? It seems awkward that
>>> it's not a single write to move a task. If a single mapping is sufficient, then
>>
>> Moving a task in x86 is currently two writes by writing the CLOSID and RMID separately.
>> I think the MPAM approach is better and there may be opportunity to do this in a similar
>> way and both architectures use the same field(s) in the task_struct.
>
> I was referring to the userspace file write but unifying on a the same fields in
> task_struct could be good. The single write is necessary for MPAM as PMG is
> scoped to PARTID and I don't think x86 behaviour changes if it moves to the same
> approach.
>

ah - I misunderstood. You are suggesting to have one file that user writes to
to set both user space and kernel space CLOSID/RMID? This sounds like what the
existing "tasks" file does but only supports the same CLOSID/RMID for both user
space and kernel space. To support the new hardware features where the CLOSID/RMID
can be different we cannot just change "tasks" interface and would need to keep it
backward compatible. So far I assumed that it would be ok for the "tasks" file
to essentially get new meaning as the CLOSID/RMID for just user space work, which
seems to require a second file for kernel space as a consequence? So far I have
not seen an option that does not change meaning of the "tasks" file.

>>> as single new file, kernel_group,per CTRL_MON group (maybe MON groups) as
>>> suggested above but rather than a task that file could hold a path to the
>>> CTRL_MON/MON group that provides the kernel configuraion for tasks running in
>>> that group. So that this can be transparent to existing software an empty string
>>
>> Something like this would force all tasks of a group to run with the same CLOSID/RMID
>> (PARTID/PMG) when in kernel space. This seems to restrict what the hardware supports
>> and may reduce the possible use case of this feature.
>>
>> For example,
>> - There may be a scenario where there is a set of tasks with a particular allocation
>> when running in user space but when in kernel these tasks benefit from different
>> allocations. Consider for example below arrangement where tasks 1, 2, and 3 run in
>> user space with allocations from resource_groupA. While these tasks are ok with this
>> allocation when in user space they have different requirements when it comes to
>> kernel space. There may be a resource_groupB that allocates a lot of resources ("high
>> priority") that task 1 should use for kernel work and a resource_groupC that allocates
>> fewer resources that tasks 2 and 3 should use for kernel work ("medium priority").
>>
>> resource_groupA:
>> schemata: <average allocations that work for tasks 1, 2, and 3 when in user space>
>> tasks when in user space: 1, 2, 3
>>
>> resource_groupB:
>> schemata: <high priority allocations>
>> tasks when in kernel space: 1
>>
>> resource_groupC:
>> schemata: <medium priority allocations>
>> tasks when in kernel space: 2, 3
>
> I'm not sure if this would happen in the real world or not.

Ack. I would like to echo Tony's request for feedback from resctrl users
https://lore.kernel.org/lkml/aYzcpuG0PfUaTdqt@agluck-desk3/

>
>>
>> If user space is forced to have the same tasks have the same user space and kernel
>> allocations then that will force user space to create additional resource groups that
>> will use up CLOSID/PARTID that is a scarce resource.
>
> This may be undesirable even if CLOSID/PARTID were unlimited as controls which set
> a per-CLOSID/PARTID maximum don't have the same effect if the tasks are spread across
> more than one CLOSID/PARTID.

Thank you for bringing this up. I did not consider the mechanics of the memory bandwidth
controls.

>
>>
>> - There may be a scenario where the user is attempting to understand system behavior by
>> monitoring individual or subsets of tasks' bandwidth usage when in kernel space.
>
> This seems useful to me.
>
>>
>> - From what I can tell PLZA also supports *different* allocations when in user vs
>> kernel space while using the *same* monitoring group for both. This does not seem
>> transferable to MPAM and would take more effort to support in resctrl but it is
>> a use case that the hardware enables.
>
> Ah yes, I think this ends the 'kernel_group' idea then. I was too focused on
> MPAM and forgotten to consider the case where PMG and PARTID are independent.

Of course we would want user space to have consistent experience from resctrl no matter the
architecture so these places where architectures behave different needs more care.

>> When enabling a feature I would of course prefer not to add unnecessary complexity. Even so,
>> resctrl is expected to expose hardware capabilities to user space. There seems to be some
>> opinions on how user space will now and forever interact with these features that
>> are not clear to me so I would appreciate more insight in why these constraints are
>> appropriate.
>
> Yes, care definitely needs to be taken here in order to not back ourselves into
> a corner.

I really appreciate the discussions to help create a useful interface.

Reinette

>
>>
>> Reinette
>>
>>> can mean use the current group's when in the kernel (as well as for
>>> userspace). A slash, /, could be used to refer to the default group. This would
>>> give something like the below under /sys/fs/resctrl.
>>>
>>> .
>>> ├── cpus
>>> ├── tasks
>>> ├── ctrl1
>>> │   ├── cpus
>>> │   ├── kernel_group -> mon_groups/mon1
>>> │   └── tasks
>>> ├── kernel_group -> ctrl1
>>> └── mon_groups
>>> └── mon1
>>> ├── cpus
>>> ├── kernel_group -> ctrl1
>>> └── tasks
>>>
>>>>
>>>> I have not read anything about the RISC-V side of this yet.
>>>>
>>>> Reinette
>>>>
>>>>>
>>>>> Reinette
>>>>>
>>>>> [1] https://lore.kernel.org/lkml/aXpgragcLS2L8ROe@agluck-desk3/
>>>>
>>>
>>> Thanks,
>>>
>>> Ben
>>
>
> Thanks,
>
> Ben