Re: [RFC] mpam,x86,fs/resctrl: Generic schema description Proof of Concept
From: Reinette Chatre
Date: Thu Jun 25 2026 - 11:44:13 EST
Hi Fenghua,
On 6/24/26 6:26 PM, Fenghua Yu wrote:
> Hi, Reinette,
>
> On 6/24/26 15:22, Reinette Chatre wrote:
>> Hi Fenghua,
>>
>> On 6/24/26 12:08 PM, Fenghua Yu wrote:
>>> Hi, Reinette, Ben, Shaopen, et al,
>>>
>>> On 5/29/26 11:06, Reinette Chatre wrote:
>>>
>>> As Shaopen and Ben mentioned earlier, we are working on two MPAM
>>> features that may need to change schemata interface. The CPU-less
>>> feature was discussed on LPC (although the interfaces will be
>>> slightly different from the LPC).
>>
>> I know. Here is where I tried to engage with you on needed interfaces after LPC:
>> https://lore.kernel.org/lkml/fb1e2686-237b-4536-acd6-15159abafcba@xxxxxxxxx/
>
> MPAM ACPI defines MSC (Memory System Control) is defined in one of two ways (not both) on one platform:
> 1. L3 and memory together on each processor MSC
> 2. L3 in processor MSC and memory control/monitoring in different memory MSCs.
>
> On type 1 platform, schemata is legacy:
> MB:1=100;2=100 <-- cache id 1 and 2 as domain id
>
> On type 2 platform, I will not reuse "MB:" name. Instead, define new resource name "MBN:" for numa node and schemata is:
> MBN:0=100;1=100;2=100;10=100;18=100;26=100 <-- numa id 0, 1, 2, 10, 18,
> 26 as domain id
> On type 2 platform, there won't be "MB:" line. Numa 0 and 1
> are for mbm allocation on socket 0 and 1. 2,10, 18 and 26 are for GPU
> memory nodes allocation.
(to help make things explicit I will refer to what you call "MBN" as "MB_NODE" to make it
explicit that it is memory bandwidth allocation at node scope)
I am trying to consider how this can be accomplished while also considering all the other
new hardware features that resctrl need to support. Consider, for example, AMD's "Global
MBA" (https://lore.kernel.org/lkml/cover.1776980182.git.babu.moger@xxxxxxx/) that throttles
memory bandwidth at L3 scope but the user configures allocations at NODE scope. At this time
the plan is to support this with a second control associated with the MB resource that can
allocate memory bandwidth at node scope. See
https://lore.kernel.org/lkml/430ffb48-29f4-44d9-9164-9f8b743b2739@xxxxxxx/
If resctrl creates a new resource for node scoped memory bandwidth allocations to support these
"type 2" systems then that will result in an inconsistent interface between architectures that
we should avoid.
Have you been listening in on the discussions surrounding emulated controls? Considering that,
would it be possible to support the "MB" control on a type "2" system but have it be backed by
(emulated by) the underlying "MB_NODE" control?
resctrl could expose both controls on these "type 2" systems but make it clear that "MB"
is emulated by "MB_NODE". For example:
info/
└── MB/
└── resource_schemata/
└── MB/
└── MB_NODE/
User will see both controls in schemata file but when changes are made to "MB" control it
will show in the "MB_NODE" control and vice-versa. User could also disable the "MB" control
that will establish familiarity with the interface at which point resctrl can drop the
"MB" control from the schemata file on these "type 2" systems.
Having the MB resource available with an MB control will keep resctrl backward compatible
if there are any tools that expect that. If backward compatibility is not of concern then
resctrl could initialize with the emulated control disabled by default. See discussion at
https://lore.kernel.org/lkml/5e575bc2-e67f-4696-9332-33c54023c057@xxxxxxxxx/
that describes a new resctrl capability in support of RISC-V and RDT.
With this resctrl could initialize with:
info/
└── MB/
└── resource_schemata/
├── MB/
│ ├── MB_NODE/
│ │ └── status:enabled
│ └── status:disabled
└── mode:legacy [native]
With above a "type 2" system will boot with its schemata file just containing the "MB_NODE"
control while info/MB describes the memory bandwidth resource.
If AMD is ok with naming their "Global MBA" allocations "MB_NODE" then a user working with
AMD and these MPAM systems could find the controls in the same hierarchy and need not
use external knowledge to determine how to interact with resctrl fs.
> BTW, Slow MBA (SMBA) is different from MBA Numa (MBN). SMBA still
> relies on L3 and the domain id in SMBA is still cache id. MBN
> depends on each memory controlor with numa id as domain id for both
> CPU and CPU-less memory nodes.
ack. Similar to AMD's "Global MBA" there is also a new feature of "Global SMBA" that
needs to be supported by resctrl.
> On type 1 platform, there is only MB:
>
> info
> └── MB
> └── resource_schemata
> ├── MB
> │ ├── max
> │ ├── min
> │ ├── resolution
> │ ├── scale
> │ ├── scope <== contains "L3"
> │ ├── tolerance
> │ ├── type
> │ └── unit
>
> On type 2 platform, there is only MBN:
> info
> └── MBN
> └── resource_schemata
> ├── MBN
> │ ├── max
> │ ├── min
> │ ├── resolution
> │ ├── scale
> │ ├── scope <== contains "NUMA"
> │ ├── tolerance
> │ ├── type
> │ └── unit
>
> This is different from the "scope" hierarchy discussed in the link. "MB" and "MBN" won't exist on the same platform.
>
> I find it's hard (and not useful) to split "MB" for memory with CPU
> and "MBN" for CPU-less memory node. It's easier to have either "MB"
> for legacy memory with CPU or "MBN" for CPU-less memory.
Please widen your considerations to include how resctrl can maintain backward compatibility
and how enabling of these platforms can fit well with the "similar but not identical" hardware
features from other architectures that also needs to be supported by resctrl.
> Any thoughts? Does this update make sense?
>
>>
>>> Hardlimit feature was not discussed yet.
>>
>> It was considered. See discussion starting at
>> https://lore.kernel.org/lkml/1c4b6b46-16f9-4887-93f5-e0f5e7f30a6f@xxxxxxxxx/
>>
>>> It's good to discuss this further in this RFC thread before the new features RFCs will be sent out.
>>
>> ack.
>>
>>>
>>> Overall, the new features can fit into this RFC well.
>>>
>>>> Hi Everybody,
>>>>
>
> [SNIP]
>
>>>>
>>>> This series can be used on an x86 system where it will show two new dummy controls
>>>> where it is possible to interact with the new controls.
>>>> For example:
>>>>
>>>> # cat schemata
>>>> MB_MAX:0=100;1=100
>>>> MB_MIN:0=100;1=100
>>>> MB:0=100;1=100
>>>
>>> Some platforms may support CPU-less node which is represented by numa node id, examples:
>>> 1. CXL type 2 memory node which provides CXL memory without CPU and L3 on the node
>>> 2. GPU memory node that can be accessed by all CPUs but doesn't have a local CPU and L3 bound to.
>>> etc.
>>>
>>> MPAM can allocate and monitor mem bandwidth on these memory node.
>>> Since no CPU and L3 on the node, cache id cannot be used in "MB:" line. Instead, numa ids are used to identify MB allocation and monitoring.
>>>
>>> For example, the MB allocation on CPU-less platforms could be:
>>> MB:0=100;1=100;2=100;10=100;18=100;26=100
>>>
>>> Where: domain id 0, 1, 2, etc are numa node id shown in /sys/devices/system/node directory or by numctl.
>>> 0: socket 0, node 0, CPUs, memory
>>> 1: socket 1, node 1, CPUs, memory
>>> 2: GPU 0, node 2, no CPU, memory only
>>> 10: GPU 1, node 10, no CPU, memory only
>>> 18: GPU 3, node 18, no CPU, memory only
>>> 26: GPU 4, node 26, no CPU, memory only
>>>
>>> Arch specific driver (e.g. MPAM) detects CPU-less node. If there is any CPU-less node, use numa id in "MB:". Otherwise, fallback to legacy cache id.
>>
>> We always have to consider backward compatibility and to do so we cannot just retroactively
>> change what domain ID represents when user space interacts with the "MB" control.
>>
>> The legacy "MB" control is already defined and its domain ID represents an L3 cache ID. To
>> support these new devices resctrl would need to expose a new control.
>>
>
> Agree. Add new control "MBN" for Memory Bandwidth Allocation on numa node. See above.
>
>>>
>>
>>> There is another MPAM feature called MBW Max hardlimit which sets
>>> "MB:" allocation as hardlimit (i.e. MBW throttling percentage must
>>> be satisfied) per domain. Adding a new "MB_HLIM:" line in schemata.
>>> It's 1:1 mapped to "MB:" to control hardlimit of MB throttling
>>> percentage on each domain. By default hardlimit is off (0) and can
>>> be turned on to set MBW Max hardlimit on a domain.
>>
>> ack. This sounds like a new control associated with the MB resource.
>> This is a boolean control as Dave highlighted in previous discussion so
>> resctrl would need to know its properties.
>> See https://lore.kernel.org/lkml/aO0Oazuxt54hQFbx@xxxxxxxxxxxxxxx/
>>
>
> Right. ("MB_HLIM" name may be adjusted accordingly when "MB_MAX" is available.)
>
>>> For exmple:
>>> MB_HLIM: 0=0;1=0;2=1;10=0;18=0;26=0
>>> MB:0=100;1=100;2=80;10=100;18=100;26=100
>>>
>>> On GPU memory numa node 2: cannot use more than 80% of total max mbw even if there is still idle mem bandwidth on this node).
>>>
>>> MBW allocations on all other domains are soft limited, meaning MBW can be used more than specified if mem is idle.
>>>
>>
>> ack.
>>
>>>> L3:0=fff;1=fff
>>>> # echo 'MB_MIN:0=50' > schemata
>>>> # cat schemata
>>>> MB_MAX:0=100;1=100
>>>> MB_MIN:0=50;1=100
>>>> MB:0=100;1=100
>>>> L3:0=fff;1=fff
>>>>
>>>> Writing to the dummy control will call a dummy callback that just prints to the
>>>> kernel log:
>>>> "resctrl: Updata temporary MIN control on domain 0 with user value 50"
>>>>
>>>>
>>>> Example output of info/MB/:
>>>> /sys/fs/resctrl/info/MB/thread_throttle_mode:max
>>>> /sys/fs/resctrl/info/MB/num_closids:15
>>>> /sys/fs/resctrl/info/MB/delay_linear:1
>>>> /sys/fs/resctrl/info/MB/min_bandwidth:10
>>>
>>> Add two new MB info RO files:
>>> 1. /sys/fs/resctrl/info/MB/domain_id
>>> It shows "numa" for using numa id in "MB:" or "cache" for using legacy cache id.
>>
>> This proposal introduces a *global* property to the MB *resource*? It does not seem as though
>> this takes into account *anything* about how resctrl can support new hardware that has been
>> discussed before, during, or after LPC. You have not participated in these discussions and
>> now make an orthogonal proposal that does not take into account *any* of the requirements
>> that we have been struggling with for months.
>>
>> Why should this proposal be taken seriously? In your absence folks have been trying to
>> accommodate how these upcoming products and be supported and the "scope" file associated with
>> a control is intended to communicate to user space how the domain ID should be interpreted.
>>
>> Why are you proposing something entirely different here without even acknowledging current
>> approach and explaining why it does not work for you?
>>
>
> So can I change this part to adding the following files in info dirctory?
>
> 1. For numa memory bw allocation (MBN):
> /sys/fs/resctrl/info/MBN/resource_schemata/MBN/
> /sys/fs/resctrl/info/MBN/resource_schemata/MBN/resolution:100
> /sys/fs/resctrl/info/MBN/resource_schemata/MBN/tolerance:5
> /sys/fs/resctrl/info/MBN/resource_schemata/MBN/type:scalar
> /sys/fs/resctrl/info/MBN/resource_schemata/MBN/min:10
> /sys/fs/resctrl/info/MBN/resource_schemata/MBN/scale:1
> /sys/fs/resctrl/info/MBN/resource_schemata/MBN/scope:NUMA
> /sys/fs/resctrl/info/MBN/resource_schemata/MBN/unit:all
> /sys/fs/resctrl/info/MBN/resource_schemata/MBN/max:100
This is not just about adding files to the info directory. The files, directories, their relationships,
and content have meaning. All I see from these proposals is an attempt to slap some new files into
resctrl without any consideration to present consistent interface to users and without consideration of
other architectures that need to be supported by resctrl.
resctrl needs to provide a generic and consistent interface to user space irrespective of the
underlying architecture. Architectures cannot just slap some new files for their convenience.
>
>>> 2. /sys/fs/resctrl/info/MB/max_lim
>>> It shows number 0-3 for MPAM MBW max limit behaviors: 0 for supporting both softlimit and hardlimit, etc.
>>
>> Again this adds another *global* property to the MB resource but then above you
>> describe the new "MB_HLIM" schemata file entry that implies that it is a new control
>> for the MB resource. Having it be a new control for the MB resource matches earlier
>> discussions. To support this I thus expect it to be exposed as a new control with
>> potentially a new type if any of the existing planned types do not suffice.
>>
>
> How about adding these MB_HLIM dir and files in info?
>
> /sys/fs/resctrl/info/MB_HLIM/resource_schemata/MB_HLIM/type: boolean
> /sys/fs/resctrl/info/MB_HLIM/resource_schemata/MB_HLIM/max_lim: 0
This presents "MB_HLIM" as a *resource* to user space. It is not a resource
but a *control* of a resource, no? I thus expect it to instead look something like
below that makes it clear that MB_HARDMAX is a control of the MB resource.
info
└── MB
└── resource_schemata
├── MB
└── MB_HARDMAX
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB/resolution:100
>>>
>>> Is it more concise to s/resource_schemata/schemata/? "resource_" seems redundant in the context "info/MB".
>>
>> We could do this, yes.
>>
>>>
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB/tolerance:5
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB/type:scalar
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB/min:10
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB/scale:1
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB/scope:L3
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB/unit:all
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB/max:100
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/resolution:100
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/tolerance:5
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/type:scalar
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/min:10
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/scale:1
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/scope:L3
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/unit:all
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/max:100
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/resolution:100
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/tolerance:5
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/type:scalar
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/min:10
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/scale:1
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/scope:L3
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/unit:all
>>>> /sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/max:100
>>>> /sys/fs/resctrl/info/MB/bandwidth_gran:10
>>>
>>> For MBW monitoring, extend mon_data/ directory to monitor CPU-less memory node. For example,
>>
>> Here is where I attempted to discuss with you how to support monitoring on these systems:
>> https://lore.kernel.org/lkml/fb1e2686-237b-4536-acd6-15159abafcba@xxxxxxxxx/
>>
>> Here again you respond with something completely different without acknowledging
>> the previous discussion or noting why that does not work for you.
>>
>>>
>>> On legacy platforms (i.e. L3 and memory are described in same MPAM ACPI MSC wich doesn't support CPU-less nodes):
>>> mon_data/mbm_L3_01/llc_occupancy
>>> mon_data/mbm_L3_01/mbm_total_bytes
>>> mon_data/mbm_L3_02/llc_occupancy
>>> mon_data/mbm_L3_02/mbm_total_bytes
>>>
>>> On platforms with L3 and memory in separate MPAM ACPI MSCs
>>> but there is no CPU-less node:
>>> mon_data/mbm_L3_01/llc_occupancy <- cache id 1
>>> mon_data/mbm_L3_02/llc_occupancy <- cache id 2
>>> mon_data/mbm_MB_00/mbm_total_bytes <- numa node 0 (socket 0)
>>> mon_data/mbm_MB_01/mbm_total_bytes <- numa node 1 (socket 1)
>>
>> Here too I do not find it appropriate for resctrl to retroactively
>> change its interface. "MB" is a resource and the above switches the
>> resctrl interface to imply the monitoring data of a resource can be
>> found in the mon directory that matches the resource name. This is
>> not what resctrl does today. Doing something like above will result in
>> resctrl having a confusing interface where "sometimes" memory bandwidth
>> data can be found in the L3 directory and "sometimes" memory bandwidth
>> data can be found in the MB directory.
>>
>> As I described to you in December resctrl already exposes the monitoring
>> data based on the *scope*. As you also point out above, today the memory
>> bandwidth monitoring data at L3 scope can be found in the L3 directory.
>> "L3" should thus not be interpreted as the resource L3 but the scope L3
>> since it contains MBM data today. When viewing it as such resctrl could
>> internally be more explicit and separate monitoring scope from monitoring
>> resource and present the monitoring data based on scope to remain intuitively
>> backward compatible while obtaining support for these memory nodes.
>>
>>>
>>> On platforms with L3 and memory in separate MPAM ACPI MSCs
>>> and there are CPU-less nodes:
>>> mon_data/mbm_L3_01/llc_occupancy <- cache id 1
>>> mon_data/mbm_L3_02/llc_occupancy <- cache id 2
>>> mon_data/mbm_MB_00/mbm_total_bytes <- numa node 0 (socket 0)
>>> mon_data/mbm_MB_01/mbm_total_bytes <- numa node 1 (socket 1)
>>> mon_data/mbm_MB_02/mbm_total_bytes <- numa node 2 (GPU 0 mem)
>>> mon_data/mbm_MB_10/mbm_total_bytes <- numa node 10 (GPU 1 mem)
>>> mon_data/mbm_MB_18/mbm_total_bytes <- numa node 18 (GPU 2 mem)
>>> mon_data/mbm_MB_26/mbm_total_bytes <- numa node 26 (GPU 3 mem)
>> To be backward compatible I find it more intuitive if instead
>> this data is exposed as below:
>>
>> mon_data/mbm_L3_01/llc_occupancy <- cache id 1
>> mon_data/mbm_L3_02/llc_occupancy <- cache id 2
>> mon_data/mbm_NODE_00/mbm_total_bytes <- numa node 0 (socket 0)
>> mon_data/mbm_NODE_01/mbm_total_bytes <- numa node 1 (socket 1)
>> mon_data/mbm_NODE_02/mbm_total_bytes <- numa node 2 (GPU 0 mem)
>> mon_data/mbm_NODE_10/mbm_total_bytes <- numa node 10 (GPU 1 mem)
>> mon_data/mbm_NODE_18/mbm_total_bytes <- numa node 18 (GPU 2 mem)
>> mon_data/mbm_NODE_26/mbm_total_bytes <- numa node 26 (GPU 3 mem)
>>
>> When "mbm_total_bytes" move from mon_data/mbm_L3_x to
>> mon_data/mbm_NODE_x it clearly indicates that it is memory bandwidth
>> monitoring data moving from "L3" scope to "NODE" scope.
>
> The scope in mon_data shown in https://lore.kernel.org/lkml/fb1e2686-237b-4536-acd6-15159abafcba@xxxxxxxxx/:
>
> mon_data
> ├── mon_L3_00 <== monitoring data at scope L3
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> ├── mon_L3_01 <== monitoring data at scope L3
> │ ├── llc_occupancy
> │ ├── mbm_local_bytes
> │ └── mbm_total_bytes
> ├── mon_NODE_00 <== monitoring data at scope NODE
> │ └── mbm_total_bytes
> └── mon_NODE_01 <== monitoring data at scope NODE
> └── mbm_total_bytes
>
> On some ARM platforms, for example, socket 0 (CPUs+L3) and socket 1
> (CPUs+L3):
> llc_occupancy is monitored through processor/L3.
> total_bytes are monitored through memory controlor.
>
> Then the above "scope" is confused:
> mon_L3_00 and mon_L3_01 only has llc_occupancy. The scope "L3" is fine here.
> But mon_NODE_00 and 01 are confused because numa node 00 and 01 have
> both L3 and memory. The name "mon_NODE_00" seems monitor both
While the node may have L3 and memory the directory only represents what is
monitored at the particular scope.
> llc_occupany and total_bytes but it only monitor total_bytes. And on
> CPU-less platforms, a CPU-less node does only have total_bytes which
> seems match "scope NODE".
> With "scope NODE", can user tell if it has llc_occupancy or total_bytes or both?
resctrl has a "mon_features" file associated with the monitoring scope that
informs user space which events can be expected in a resource group's monitoring
data directories.
>
> If change "mon_MB_01" to "mon_MM_01", the scope is "MM" now which
> means monitoring memory (total_bytes) in numa node 1 with "scope MM"
> (not L3)?
This sounds redundant to me since the event names already have the resource embedded.
"llc_occupancy" implies L3 occupancy
"mbm_local_bytes" ... the "mbm" implies this is memory bandwidth monitoring data.
User space can infer the scope at which the monitoring data is collected from the
directory the event file is in.
Reinette