Re: [RFC] mpam,x86,fs/resctrl: Generic schema description Proof of Concept
From: Reinette Chatre
Date: Wed Jun 24 2026 - 18:23:19 EST
Hi Fenghua,
On 6/24/26 12:08 PM, Fenghua Yu wrote:
> Hi, Reinette, Ben, Shaopen, et al,
>
> On 5/29/26 11:06, Reinette Chatre wrote:
>
> As Shaopen and Ben mentioned earlier, we are working on two MPAM
> features that may need to change schemata interface. The CPU-less
> feature was discussed on LPC (although the interfaces will be
> slightly different from the LPC).
I know. Here is where I tried to engage with you on needed interfaces after LPC:
https://lore.kernel.org/lkml/fb1e2686-237b-4536-acd6-15159abafcba@xxxxxxxxx/
> Hardlimit feature was not discussed yet.
It was considered. See discussion starting at
https://lore.kernel.org/lkml/1c4b6b46-16f9-4887-93f5-e0f5e7f30a6f@xxxxxxxxx/
> It's good to discuss this further in this RFC thread before the new features RFCs will be sent out.
ack.
>
> Overall, the new features can fit into this RFC well.
>
>> Hi Everybody,
>>
>> It has been a while since we discussed the resctrl changes required to support
>> hardware that has controls with fine granularity or hardware that has multiple
>> controls per resource. For reference, the most recent email discussion can
>> be found at [1] with a summary of discussions in last year's plumbers slides [2].
>>
>> I created a PoC that I believe supports what folks have agreed to so far. I
>> hope this can help us to restart the discussion with the goal that resctrl gains
>> support for upcoming hardware that require these features.
>>
>> Request regarding this PoC
>> ==========================
>>
>> Please consider this PoC as a "direction check" on the schema description and multiple
>> control discussions held thus far.
>>
>> Could folks working on enabling new hardware requiring this capability please consider
>> if this is something you can build on and how it should be improved to support these
>> upcoming capabilities?
>>
>> Opens
>> =====
>>
>> While the PoC aims to support what folks agreed on some opens remain:
>> - I attempted to make some MPAM supporting changes but these are all just compile
>> tested. While MPAM should benefit from the new control properties I did not
>> initialize them on MPAM and did not attempt refactor to separate out
>> the architecture specific control properties (more on what this means later).
>> I did attempt some MPAM refactoring that duplicates the MPAM domain to the
>> control domain and monitoring domain lists in support of there being multiple
>> controls each with its own list of control domains but it is definitely not good
>> design.
>> - No support for emulated controls (yet). The PoC is quite large already
>> but I think it can be used as a base for emulated controls for which the software
>> controller could be a potential first customer. In this PoC mounting with
>> software controller will still display the original controller's properties.
>> - One open that needs to be addressed as part of support for emulated controls is
>> how best to display emulation relationship via resctrl hierarchy.
>> - No support for "read-modify-write" usage of schemata file. This is where we
>> discussed (without agreement) on possibly introducing the "#" prefix to schemata
>> file entries. This PoC does not support this prefix and the current assumption/expectation
>> is that when user space changes a configuration only the new control values are
>> written to schemata file. I thus do not have a plan to support this so please
>> share opinions in this regard if you have some.
>> - Controls are independent for now. This means that, for example, if a resource
>> supports a "MIN" and "MAX" control then this implementation would allow user to
>> set the "maximum" control values to be less than the "minimum" control values.
>> - PoC supports the "bitmap" control but does not (yet) expose properties of a bitmap
>> control to the new info/<resource>/resource_schemata directory.
>>
>> Accessing PoC
>> =============
>>
>> Please consider the PoC as a rough draft. It has only been compile tested for Arm
>> and known to be incomplete in Arm support. To help with experimenting I only
>> fully adapted the Intel MBA resource to demo two dummy additional MBA controls.
>> All architectures should immediately benefit from the new schema descriptions
>> and new info/MB/resource_schemata hierarchy.
>>
>> I considered the patches self too many for email. Instead, the PoC can be found at:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/reinette/linux.git branch resctrl/controls_rfc_v1
>>
>> The work is based on v7.1-rc2 that also includes the following series (two of which has
>> since been queued) included:
>>
>> "selftests/resctrl: Fixes and improvements focused on Intel platforms"
>> https://lore.kernel.org/lkml/cover.1775266384.git.reinette.chatre@xxxxxxxxx/
>>
>> "x86,fs/resctrl: Improve resctrl quality and consistency"
>> https://lore.kernel.org/lkml/cover.1777419024.git.reinette.chatre@xxxxxxxxx/
>>
>> "x86,fs/resctrl: Pave the way for MPAM counter assignment"
>> https://lore.kernel.org/lkml/20260506082855.3694761-1-ben.horgan@xxxxxxx/
>>
>>
>> Primary resctrl fs data structure changes
>> =========================================
>>
>> Introduces a control represented by struct resctrl_ctrl that looks as below. To make
>> the changes easier to follow I kept some of the original names to help communicate
>> where familiar data structures land.
>>
>> What to notice about a control is that it has some common properties required
>> from all controls (scope, type, etc.) and then depending on the type of control
>> (RESCTRL_CTRL_BITMAP or RESCTRL_CTRL_SCALAR) there are type specific properties.
>>
>> /**
>> * struct resctrl_ctrl - A resource control
>> * @entry: List entry of rdt_resource::controls
>> * @scope: Scope of the resource that this control allocates
>> * @domains: RCU list of all control domains
>> * @type: The control type that determines the properties of the control,
>> * format string for displaying control values to user space, and
>> * parser of control values provided by user space.
>> * @name: Name of the control. Appended to final resource name
>> * (rdt_resource_final::name) to create final schema entry.
>> * Specifically, "rdt_resource_final::name"_"resctrl_ctrl::name".
>> * For example, with resource name "MB" and control name "MAX" the
>> * schema entry will be "MB_MAX".
>> * @cache: Cache allocation control properties.
>> * @membw: Bandwidth control properties.
>> */
>> struct resctrl_ctrl {
>> struct list_head entry;
>> enum resctrl_scope scope;
>> struct list_head domains;
>> enum resctrl_ctrl_type type;
>> enum resctrl_ctrl_name name;
>> union {
>> struct resctrl_cache cache;
>> struct resctrl_membw membw;
>> };
>> };
>>
>> Two members summarize how this new structure fits into the rest of resctrl:
>> a) resctrl_ctrl::entry
>> Since a resource can support multiple controls there is a new list
>> in struct rdt_resource named "controls" that contains the list of all
>> controls supported by the resource.
>> b) resctrl_ctrl::domains
>> Instead of the list of control domains belonging to a resource they
>> now belong to the control self. By doing so resctrl can support resource
>> controls at different scope for the same resource. This is intended to
>> support some upcoming MPAM and RISC-V usages.
>>
>> Example architectural data structure changes
>> ============================================
>>
>> An architecture can use the new control by following a similar pattern to
>> resource and domain use by architectures. Consider the following for x86
>> where a new architecture specific struct resctrl_hw_ctrl includes
>> struct resctrl_ctrl and any architecture private data needed to support
>> the control:
>>
>> /*
>> * struct resctrl_hw_ctrl - Arch private properties of a resource control
>> * @r_ctrl: Control properties exposed to resctrl file system
>> * @msr_base: Base MSR address where control values should be programmed
>> * @msr_update: Function pointer to update control values
>> */
>> struct resctrl_hw_ctrl {
>> struct resctrl_ctrl r_ctrl;
>> unsigned int msr_base;
>> void (*msr_update)(struct msr_param *m);
>> };
>>
>> Structure of patch series
>> =========================
>>
>> As a PoC the series is not perfectly structured but to help navigate this work
>> on a high level the changes can be categorized as follows:
>>
>> Patch 1 to 11:
>> With a vision of what a "control" is, remove unused/unnecessary
>> members, make clear what is a *resource* property vs a *control*
>> property, do some renaming to help with the PoC.
>>
>> Patch 12:
>> Introduce struct resctrl_ctrl and re-arrange existing struct rdt_resource
>> members to form part of new rdt_resource::ctrl
>>
>> Patch 13 to 44:
>> A lot of wrangling to introduce struct resctrl_ctrl to all code that needs
>> to work with a control and/or domain without assuming that the control is
>> the one and only control embedded in the resource it belongs to. Essentially,
>> a lot of changes passing the control around in addition to the resource/domain.
>>
>> Patch 45:
>> Switch the single struct resctrl_ctrl member of struct rdt_resource to be
>> a list of struct resctrl_ctrl.
>>
>> Patch 47 to 49:
>> Introduce new info/<resource>/resource_schemata hierarchy to first only
>> consist of properties already known to resctrl fs.
>>
>> Patch 50 to 52:
>> Introduce the new control properties per [1], initialize them for x86,
>> and expose them via info/<resource>/resource_schemata
>>
>> Patch 53:
>> Let the new struct resctrl_hw_ctrl contain architecture's control properties.
>>
>> Patch 54:
>> Teach resctrl fs about "MIN" and "MAX" controls.
>>
>> Patch 55:
>> Sample of "MIN" and "MAX" memory bandwidth controls for x86.
>>
>> Example interactions
>> ====================
>>
>> This series can be used on an x86 system where it will show two new dummy controls
>> where it is possible to interact with the new controls.
>> For example:
>>
>> # cat schemata
>> MB_MAX:0=100;1=100
>> MB_MIN:0=100;1=100
>> MB:0=100;1=100
>
> Some platforms may support CPU-less node which is represented by numa node id, examples:
> 1. CXL type 2 memory node which provides CXL memory without CPU and L3 on the node
> 2. GPU memory node that can be accessed by all CPUs but doesn't have a local CPU and L3 bound to.
> etc.
>
> MPAM can allocate and monitor mem bandwidth on these memory node.
> Since no CPU and L3 on the node, cache id cannot be used in "MB:" line. Instead, numa ids are used to identify MB allocation and monitoring.
>
> For example, the MB allocation on CPU-less platforms could be:
> MB:0=100;1=100;2=100;10=100;18=100;26=100
>
> Where: domain id 0, 1, 2, etc are numa node id shown in /sys/devices/system/node directory or by numctl.
> 0: socket 0, node 0, CPUs, memory
> 1: socket 1, node 1, CPUs, memory
> 2: GPU 0, node 2, no CPU, memory only
> 10: GPU 1, node 10, no CPU, memory only
> 18: GPU 3, node 18, no CPU, memory only
> 26: GPU 4, node 26, no CPU, memory only
>
> Arch specific driver (e.g. MPAM) detects CPU-less node. If there is any CPU-less node, use numa id in "MB:". Otherwise, fallback to legacy cache id.
We always have to consider backward compatibility and to do so we cannot just retroactively
change what domain ID represents when user space interacts with the "MB" control.
The legacy "MB" control is already defined and its domain ID represents an L3 cache ID. To
support these new devices resctrl would need to expose a new control.
>
> There is another MPAM feature called MBW Max hardlimit which sets
> "MB:" allocation as hardlimit (i.e. MBW throttling percentage must
> be satisfied) per domain. Adding a new "MB_HLIM:" line in schemata.
> It's 1:1 mapped to "MB:" to control hardlimit of MB throttling
> percentage on each domain. By default hardlimit is off (0) and can
> be turned on to set MBW Max hardlimit on a domain.
ack. This sounds like a new control associated with the MB resource.
This is a boolean control as Dave highlighted in previous discussion so
resctrl would need to know its properties.
See https://lore.kernel.org/lkml/aO0Oazuxt54hQFbx@xxxxxxxxxxxxxxx/
> For exmple:
> MB_HLIM: 0=0;1=0;2=1;10=0;18=0;26=0
> MB:0=100;1=100;2=80;10=100;18=100;26=100
>
> On GPU memory numa node 2: cannot use more than 80% of total max mbw even if there is still idle mem bandwidth on this node).
>
> MBW allocations on all other domains are soft limited, meaning MBW can be used more than specified if mem is idle.
>
ack.
>> L3:0=fff;1=fff
>> # echo 'MB_MIN:0=50' > schemata
>> # cat schemata
>> MB_MAX:0=100;1=100
>> MB_MIN:0=50;1=100
>> MB:0=100;1=100
>> L3:0=fff;1=fff
>>
>> Writing to the dummy control will call a dummy callback that just prints to the
>> kernel log:
>> "resctrl: Updata temporary MIN control on domain 0 with user value 50"
>>
>>
>> Example output of info/MB/:
>> /sys/fs/resctrl/info/MB/thread_throttle_mode:max
>> /sys/fs/resctrl/info/MB/num_closids:15
>> /sys/fs/resctrl/info/MB/delay_linear:1
>> /sys/fs/resctrl/info/MB/min_bandwidth:10
>
> Add two new MB info RO files:
> 1. /sys/fs/resctrl/info/MB/domain_id
> It shows "numa" for using numa id in "MB:" or "cache" for using legacy cache id.
This proposal introduces a *global* property to the MB *resource*? It does not seem as though
this takes into account *anything* about how resctrl can support new hardware that has been
discussed before, during, or after LPC. You have not participated in these discussions and
now make an orthogonal proposal that does not take into account *any* of the requirements
that we have been struggling with for months.
Why should this proposal be taken seriously? In your absence folks have been trying to
accommodate how these upcoming products and be supported and the "scope" file associated with
a control is intended to communicate to user space how the domain ID should be interpreted.
Why are you proposing something entirely different here without even acknowledging current
approach and explaining why it does not work for you?
>
> 2. /sys/fs/resctrl/info/MB/max_lim
> It shows number 0-3 for MPAM MBW max limit behaviors: 0 for supporting both softlimit and hardlimit, etc.
Again this adds another *global* property to the MB resource but then above you
describe the new "MB_HLIM" schemata file entry that implies that it is a new control
for the MB resource. Having it be a new control for the MB resource matches earlier
discussions. To support this I thus expect it to be exposed as a new control with
potentially a new type if any of the existing planned types do not suffice.
>> /sys/fs/resctrl/info/MB/resource_schemata/MB/resolution:100
>
> Is it more concise to s/resource_schemata/schemata/? "resource_" seems redundant in the context "info/MB".
We could do this, yes.
>
>> /sys/fs/resctrl/info/MB/resource_schemata/MB/tolerance:5
>> /sys/fs/resctrl/info/MB/resource_schemata/MB/type:scalar
>> /sys/fs/resctrl/info/MB/resource_schemata/MB/min:10
>> /sys/fs/resctrl/info/MB/resource_schemata/MB/scale:1
>> /sys/fs/resctrl/info/MB/resource_schemata/MB/scope:L3
>> /sys/fs/resctrl/info/MB/resource_schemata/MB/unit:all
>> /sys/fs/resctrl/info/MB/resource_schemata/MB/max:100
>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/resolution:100
>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/tolerance:5
>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/type:scalar
>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/min:10
>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/scale:1
>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/scope:L3
>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/unit:all
>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/max:100
>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/resolution:100
>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/tolerance:5
>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/type:scalar
>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/min:10
>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/scale:1
>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/scope:L3
>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/unit:all
>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/max:100
>> /sys/fs/resctrl/info/MB/bandwidth_gran:10
>
> For MBW monitoring, extend mon_data/ directory to monitor CPU-less memory node. For example,
Here is where I attempted to discuss with you how to support monitoring on these systems:
https://lore.kernel.org/lkml/fb1e2686-237b-4536-acd6-15159abafcba@xxxxxxxxx/
Here again you respond with something completely different without acknowledging
the previous discussion or noting why that does not work for you.
>
> On legacy platforms (i.e. L3 and memory are described in same MPAM ACPI MSC wich doesn't support CPU-less nodes):
> mon_data/mbm_L3_01/llc_occupancy
> mon_data/mbm_L3_01/mbm_total_bytes
> mon_data/mbm_L3_02/llc_occupancy
> mon_data/mbm_L3_02/mbm_total_bytes
>
> On platforms with L3 and memory in separate MPAM ACPI MSCs
> but there is no CPU-less node:
> mon_data/mbm_L3_01/llc_occupancy <- cache id 1
> mon_data/mbm_L3_02/llc_occupancy <- cache id 2
> mon_data/mbm_MB_00/mbm_total_bytes <- numa node 0 (socket 0)
> mon_data/mbm_MB_01/mbm_total_bytes <- numa node 1 (socket 1)
Here too I do not find it appropriate for resctrl to retroactively
change its interface. "MB" is a resource and the above switches the
resctrl interface to imply the monitoring data of a resource can be
found in the mon directory that matches the resource name. This is
not what resctrl does today. Doing something like above will result in
resctrl having a confusing interface where "sometimes" memory bandwidth
data can be found in the L3 directory and "sometimes" memory bandwidth
data can be found in the MB directory.
As I described to you in December resctrl already exposes the monitoring
data based on the *scope*. As you also point out above, today the memory
bandwidth monitoring data at L3 scope can be found in the L3 directory.
"L3" should thus not be interpreted as the resource L3 but the scope L3
since it contains MBM data today. When viewing it as such resctrl could
internally be more explicit and separate monitoring scope from monitoring
resource and present the monitoring data based on scope to remain intuitively
backward compatible while obtaining support for these memory nodes.
>
> On platforms with L3 and memory in separate MPAM ACPI MSCs
> and there are CPU-less nodes:
> mon_data/mbm_L3_01/llc_occupancy <- cache id 1
> mon_data/mbm_L3_02/llc_occupancy <- cache id 2
> mon_data/mbm_MB_00/mbm_total_bytes <- numa node 0 (socket 0)
> mon_data/mbm_MB_01/mbm_total_bytes <- numa node 1 (socket 1)
> mon_data/mbm_MB_02/mbm_total_bytes <- numa node 2 (GPU 0 mem)
> mon_data/mbm_MB_10/mbm_total_bytes <- numa node 10 (GPU 1 mem)
> mon_data/mbm_MB_18/mbm_total_bytes <- numa node 18 (GPU 2 mem)
> mon_data/mbm_MB_26/mbm_total_bytes <- numa node 26 (GPU 3 mem)
To be backward compatible I find it more intuitive if instead
this data is exposed as below:
mon_data/mbm_L3_01/llc_occupancy <- cache id 1
mon_data/mbm_L3_02/llc_occupancy <- cache id 2
mon_data/mbm_NODE_00/mbm_total_bytes <- numa node 0 (socket 0)
mon_data/mbm_NODE_01/mbm_total_bytes <- numa node 1 (socket 1)
mon_data/mbm_NODE_02/mbm_total_bytes <- numa node 2 (GPU 0 mem)
mon_data/mbm_NODE_10/mbm_total_bytes <- numa node 10 (GPU 1 mem)
mon_data/mbm_NODE_18/mbm_total_bytes <- numa node 18 (GPU 2 mem)
mon_data/mbm_NODE_26/mbm_total_bytes <- numa node 26 (GPU 3 mem)
When "mbm_total_bytes" move from mon_data/mbm_L3_x to
mon_data/mbm_NODE_x it clearly indicates that it is memory bandwidth
monitoring data moving from "L3" scope to "NODE" scope.
Reinette