Re: [RFC] mpam,x86,fs/resctrl: Generic schema description Proof of Concept
From: Fenghua Yu
Date: Wed Jun 24 2026 - 21:27:38 EST
Hi, Reinette,
On 6/24/26 15:22, Reinette Chatre wrote:
Hi Fenghua,
On 6/24/26 12:08 PM, Fenghua Yu wrote:
Hi, Reinette, Ben, Shaopen, et al,
On 5/29/26 11:06, Reinette Chatre wrote:
As Shaopen and Ben mentioned earlier, we are working on two MPAM
features that may need to change schemata interface. The CPU-less
feature was discussed on LPC (although the interfaces will be
slightly different from the LPC).
I know. Here is where I tried to engage with you on needed interfaces after LPC:
https://lore.kernel.org/lkml/fb1e2686-237b-4536-acd6-15159abafcba@xxxxxxxxx/
MPAM ACPI defines MSC (Memory System Control) is defined in one of two ways (not both) on one platform:
1. L3 and memory together on each processor MSC
2. L3 in processor MSC and memory control/monitoring in different memory MSCs.
On type 1 platform, schemata is legacy:
MB:1=100;2=100 <-- cache id 1 and 2 as domain id
On type 2 platform, I will not reuse "MB:" name. Instead, define new resource name "MBN:" for numa node and schemata is:
MBN:0=100;1=100;2=100;10=100;18=100;26=100 <-- numa id 0, 1, 2, 10, 18,
26 as domain id
On type 2 platform, there won't be "MB:" line. Numa 0 and 1
are for mbm allocation on socket 0 and 1. 2,10, 18 and 26 are for GPU
memory nodes allocation.
BTW, Slow MBA (SMBA) is different from MBA Numa (MBN). SMBA still relies on L3 and the domain id in SMBA is still cache id. MBN depends on each memory controlor with numa id as domain id for both CPU and CPU-less memory nodes.
On type 1 platform, there is only MB:
info
└── MB
└── resource_schemata
├── MB
│ ├── max
│ ├── min
│ ├── resolution
│ ├── scale
│ ├── scope <== contains "L3"
│ ├── tolerance
│ ├── type
│ └── unit
On type 2 platform, there is only MBN:
info
└── MBN
└── resource_schemata
├── MBN
│ ├── max
│ ├── min
│ ├── resolution
│ ├── scale
│ ├── scope <== contains "NUMA"
│ ├── tolerance
│ ├── type
│ └── unit
This is different from the "scope" hierarchy discussed in the link. "MB" and "MBN" won't exist on the same platform.
I find it's hard (and not useful) to split "MB" for memory with CPU and "MBN" for CPU-less memory node. It's easier to have either "MB" for legacy memory with CPU or "MBN" for CPU-less memory.
Any thoughts? Does this update make sense?
Hardlimit feature was not discussed yet.
It was considered. See discussion starting at
https://lore.kernel.org/lkml/1c4b6b46-16f9-4887-93f5-e0f5e7f30a6f@xxxxxxxxx/
It's good to discuss this further in this RFC thread before the new features RFCs will be sent out.
ack.
Overall, the new features can fit into this RFC well.
Hi Everybody,
[SNIP]
This series can be used on an x86 system where it will show two new dummy controls
where it is possible to interact with the new controls.
For example:
# cat schemata
MB_MAX:0=100;1=100
MB_MIN:0=100;1=100
MB:0=100;1=100
Some platforms may support CPU-less node which is represented by numa node id, examples:
1. CXL type 2 memory node which provides CXL memory without CPU and L3 on the node
2. GPU memory node that can be accessed by all CPUs but doesn't have a local CPU and L3 bound to.
etc.
MPAM can allocate and monitor mem bandwidth on these memory node.
Since no CPU and L3 on the node, cache id cannot be used in "MB:" line. Instead, numa ids are used to identify MB allocation and monitoring.
For example, the MB allocation on CPU-less platforms could be:
MB:0=100;1=100;2=100;10=100;18=100;26=100
Where: domain id 0, 1, 2, etc are numa node id shown in /sys/devices/system/node directory or by numctl.
0: socket 0, node 0, CPUs, memory
1: socket 1, node 1, CPUs, memory
2: GPU 0, node 2, no CPU, memory only
10: GPU 1, node 10, no CPU, memory only
18: GPU 3, node 18, no CPU, memory only
26: GPU 4, node 26, no CPU, memory only
Arch specific driver (e.g. MPAM) detects CPU-less node. If there is any CPU-less node, use numa id in "MB:". Otherwise, fallback to legacy cache id.
We always have to consider backward compatibility and to do so we cannot just retroactively
change what domain ID represents when user space interacts with the "MB" control.
The legacy "MB" control is already defined and its domain ID represents an L3 cache ID. To
support these new devices resctrl would need to expose a new control.
Agree. Add new control "MBN" for Memory Bandwidth Allocation on numa node. See above.
There is another MPAM feature called MBW Max hardlimit which sets
"MB:" allocation as hardlimit (i.e. MBW throttling percentage must
be satisfied) per domain. Adding a new "MB_HLIM:" line in schemata.
It's 1:1 mapped to "MB:" to control hardlimit of MB throttling
percentage on each domain. By default hardlimit is off (0) and can
be turned on to set MBW Max hardlimit on a domain.
ack. This sounds like a new control associated with the MB resource.
This is a boolean control as Dave highlighted in previous discussion so
resctrl would need to know its properties.
See https://lore.kernel.org/lkml/aO0Oazuxt54hQFbx@xxxxxxxxxxxxxxx/
Right. ("MB_HLIM" name may be adjusted accordingly when "MB_MAX" is available.)
For exmple:
MB_HLIM: 0=0;1=0;2=1;10=0;18=0;26=0
MB:0=100;1=100;2=80;10=100;18=100;26=100
On GPU memory numa node 2: cannot use more than 80% of total max mbw even if there is still idle mem bandwidth on this node).
MBW allocations on all other domains are soft limited, meaning MBW can be used more than specified if mem is idle.
ack.
L3:0=fff;1=fff
# echo 'MB_MIN:0=50' > schemata
# cat schemata
MB_MAX:0=100;1=100
MB_MIN:0=50;1=100
MB:0=100;1=100
L3:0=fff;1=fff
Writing to the dummy control will call a dummy callback that just prints to the
kernel log:
"resctrl: Updata temporary MIN control on domain 0 with user value 50"
Example output of info/MB/:
/sys/fs/resctrl/info/MB/thread_throttle_mode:max
/sys/fs/resctrl/info/MB/num_closids:15
/sys/fs/resctrl/info/MB/delay_linear:1
/sys/fs/resctrl/info/MB/min_bandwidth:10
Add two new MB info RO files:
1. /sys/fs/resctrl/info/MB/domain_id
It shows "numa" for using numa id in "MB:" or "cache" for using legacy cache id.
This proposal introduces a *global* property to the MB *resource*? It does not seem as though
this takes into account *anything* about how resctrl can support new hardware that has been
discussed before, during, or after LPC. You have not participated in these discussions and
now make an orthogonal proposal that does not take into account *any* of the requirements
that we have been struggling with for months.
Why should this proposal be taken seriously? In your absence folks have been trying to
accommodate how these upcoming products and be supported and the "scope" file associated with
a control is intended to communicate to user space how the domain ID should be interpreted.
Why are you proposing something entirely different here without even acknowledging current
approach and explaining why it does not work for you?
So can I change this part to adding the following files in info dirctory?
1. For numa memory bw allocation (MBN):
/sys/fs/resctrl/info/MBN/resource_schemata/MBN/
/sys/fs/resctrl/info/MBN/resource_schemata/MBN/resolution:100
/sys/fs/resctrl/info/MBN/resource_schemata/MBN/tolerance:5
/sys/fs/resctrl/info/MBN/resource_schemata/MBN/type:scalar
/sys/fs/resctrl/info/MBN/resource_schemata/MBN/min:10
/sys/fs/resctrl/info/MBN/resource_schemata/MBN/scale:1
/sys/fs/resctrl/info/MBN/resource_schemata/MBN/scope:NUMA
/sys/fs/resctrl/info/MBN/resource_schemata/MBN/unit:all
/sys/fs/resctrl/info/MBN/resource_schemata/MBN/max:100
2. /sys/fs/resctrl/info/MB/max_lim
It shows number 0-3 for MPAM MBW max limit behaviors: 0 for supporting both softlimit and hardlimit, etc.
Again this adds another *global* property to the MB resource but then above you
describe the new "MB_HLIM" schemata file entry that implies that it is a new control
for the MB resource. Having it be a new control for the MB resource matches earlier
discussions. To support this I thus expect it to be exposed as a new control with
potentially a new type if any of the existing planned types do not suffice.
How about adding these MB_HLIM dir and files in info?
/sys/fs/resctrl/info/MB_HLIM/resource_schemata/MB_HLIM/type: boolean
/sys/fs/resctrl/info/MB_HLIM/resource_schemata/MB_HLIM/max_lim: 0
/sys/fs/resctrl/info/MB/resource_schemata/MB/resolution:100
Is it more concise to s/resource_schemata/schemata/? "resource_" seems redundant in the context "info/MB".
We could do this, yes.
/sys/fs/resctrl/info/MB/resource_schemata/MB/tolerance:5
/sys/fs/resctrl/info/MB/resource_schemata/MB/type:scalar
/sys/fs/resctrl/info/MB/resource_schemata/MB/min:10
/sys/fs/resctrl/info/MB/resource_schemata/MB/scale:1
/sys/fs/resctrl/info/MB/resource_schemata/MB/scope:L3
/sys/fs/resctrl/info/MB/resource_schemata/MB/unit:all
/sys/fs/resctrl/info/MB/resource_schemata/MB/max:100
/sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/resolution:100
/sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/tolerance:5
/sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/type:scalar
/sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/min:10
/sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/scale:1
/sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/scope:L3
/sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/unit:all
/sys/fs/resctrl/info/MB/resource_schemata/MB_MIN/max:100
/sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/resolution:100
/sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/tolerance:5
/sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/type:scalar
/sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/min:10
/sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/scale:1
/sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/scope:L3
/sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/unit:all
/sys/fs/resctrl/info/MB/resource_schemata/MB_MAX/max:100
/sys/fs/resctrl/info/MB/bandwidth_gran:10
For MBW monitoring, extend mon_data/ directory to monitor CPU-less memory node. For example,
Here is where I attempted to discuss with you how to support monitoring on these systems:
https://lore.kernel.org/lkml/fb1e2686-237b-4536-acd6-15159abafcba@xxxxxxxxx/
Here again you respond with something completely different without acknowledging
the previous discussion or noting why that does not work for you.
On legacy platforms (i.e. L3 and memory are described in same MPAM ACPI MSC wich doesn't support CPU-less nodes):
mon_data/mbm_L3_01/llc_occupancy
mon_data/mbm_L3_01/mbm_total_bytes
mon_data/mbm_L3_02/llc_occupancy
mon_data/mbm_L3_02/mbm_total_bytes
On platforms with L3 and memory in separate MPAM ACPI MSCs
but there is no CPU-less node:
mon_data/mbm_L3_01/llc_occupancy <- cache id 1
mon_data/mbm_L3_02/llc_occupancy <- cache id 2
mon_data/mbm_MB_00/mbm_total_bytes <- numa node 0 (socket 0)
mon_data/mbm_MB_01/mbm_total_bytes <- numa node 1 (socket 1)
Here too I do not find it appropriate for resctrl to retroactively
change its interface. "MB" is a resource and the above switches the
resctrl interface to imply the monitoring data of a resource can be
found in the mon directory that matches the resource name. This is
not what resctrl does today. Doing something like above will result in
resctrl having a confusing interface where "sometimes" memory bandwidth
data can be found in the L3 directory and "sometimes" memory bandwidth
data can be found in the MB directory.
As I described to you in December resctrl already exposes the monitoring
data based on the *scope*. As you also point out above, today the memory
bandwidth monitoring data at L3 scope can be found in the L3 directory.
"L3" should thus not be interpreted as the resource L3 but the scope L3
since it contains MBM data today. When viewing it as such resctrl could
internally be more explicit and separate monitoring scope from monitoring
resource and present the monitoring data based on scope to remain intuitively
backward compatible while obtaining support for these memory nodes.
To be backward compatible I find it more intuitive if instead
On platforms with L3 and memory in separate MPAM ACPI MSCs
and there are CPU-less nodes:
mon_data/mbm_L3_01/llc_occupancy <- cache id 1
mon_data/mbm_L3_02/llc_occupancy <- cache id 2
mon_data/mbm_MB_00/mbm_total_bytes <- numa node 0 (socket 0)
mon_data/mbm_MB_01/mbm_total_bytes <- numa node 1 (socket 1)
mon_data/mbm_MB_02/mbm_total_bytes <- numa node 2 (GPU 0 mem)
mon_data/mbm_MB_10/mbm_total_bytes <- numa node 10 (GPU 1 mem)
mon_data/mbm_MB_18/mbm_total_bytes <- numa node 18 (GPU 2 mem)
mon_data/mbm_MB_26/mbm_total_bytes <- numa node 26 (GPU 3 mem)
this data is exposed as below:
mon_data/mbm_L3_01/llc_occupancy <- cache id 1
mon_data/mbm_L3_02/llc_occupancy <- cache id 2
mon_data/mbm_NODE_00/mbm_total_bytes <- numa node 0 (socket 0)
mon_data/mbm_NODE_01/mbm_total_bytes <- numa node 1 (socket 1)
mon_data/mbm_NODE_02/mbm_total_bytes <- numa node 2 (GPU 0 mem)
mon_data/mbm_NODE_10/mbm_total_bytes <- numa node 10 (GPU 1 mem)
mon_data/mbm_NODE_18/mbm_total_bytes <- numa node 18 (GPU 2 mem)
mon_data/mbm_NODE_26/mbm_total_bytes <- numa node 26 (GPU 3 mem)
When "mbm_total_bytes" move from mon_data/mbm_L3_x to
mon_data/mbm_NODE_x it clearly indicates that it is memory bandwidth
monitoring data moving from "L3" scope to "NODE" scope.
The scope in mon_data shown in https://lore.kernel.org/lkml/fb1e2686-237b-4536-acd6-15159abafcba@xxxxxxxxx/:
mon_data
├── mon_L3_00 <== monitoring data at scope L3
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
├── mon_L3_01 <== monitoring data at scope L3
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
├── mon_NODE_00 <== monitoring data at scope NODE
│ └── mbm_total_bytes
└── mon_NODE_01 <== monitoring data at scope NODE
└── mbm_total_bytes
On some ARM platforms, for example, socket 0 (CPUs+L3) and socket 1
(CPUs+L3):
llc_occupancy is monitored through processor/L3.
total_bytes are monitored through memory controlor.
Then the above "scope" is confused:
mon_L3_00 and mon_L3_01 only has llc_occupancy. The scope "L3" is fine here.
But mon_NODE_00 and 01 are confused because numa node 00 and 01 have both L3 and memory. The name "mon_NODE_00" seems monitor both llc_occupany and total_bytes but it only monitor total_bytes. And on CPU-less platforms, a CPU-less node does only have total_bytes which seems match "scope NODE".
With "scope NODE", can user tell if it has llc_occupancy or total_bytes or both?
If change "mon_MB_01" to "mon_MM_01", the scope is "MM" now which means monitoring memory (total_bytes) in numa node 1 with "scope MM" (not L3)?
Thanks.
-Fenghua