Re: [RFC] fs/resctrl: Generic schema description
From: Chen, Yu C
Date: Fri Dec 26 2025 - 05:39:17 EST
Hi Reinette and all,
On 12/17/2025 6:26 AM, Reinette Chatre wrote:
Hi Babu and Fenghua,
Could you please consider how the new AMD and MPAM features [2] may benefit
from the new interfaces proposed here? More below ...
On 10/24/25 4:12 AM, Dave Martin wrote:
[snip]
One thing I was pondering is that resctrl currently uses L3 interchangeably
as a scope and a resource but if instead that is separated then it should be
easier to support interactions with resource at a different scope.
I am concerned that, for example, support for Global Memory Bandwidth Allocation
(GMBA) is planned to be done with a new resource. resctrl already has a
"memory bandwidth allocation" resource and introducing a new resource to essentially
manage the same resource, but at a different scope, sounds like a risk of fragmentation
and duplication to me.
What if the "resource control" instead gains a new property, for example, "scope" that
essentially communicates to user space what a domain ID in the schemata file means.
It is not clear to me what a "domain ID" of GMBA means so I will use the MPAM CPU-less
MBM as example that I expect will build on SMBA that supports CXL.mem. Consider, an interface
like below:
info
└── SMBA
└── resource_schemata
├── SMBA
│ ├── max
│ ├── min
│ ├── resolution
│ ├── scale
│ ├── scope <== contains "L3"
│ ├── tolerance
│ ├── type
│ └── unit
└── SMBA_NODE
├── max
├── min
├── resolution
├── scale
├── scope <== contains "NODE"
Would it be more user-friendly to explicitly show "node0, node1, ..."
rather than "NODE"? After all, we can already infer the "NODE" type from
the schemata name "SMBA_NODE".
├── tolerance
├── type
└── unit
With an interface like above there is a single resource and allocating it at a different
scope is just another control. This correlates to how other parts of resctrl is managed.
For example, it can become explicit that the monitor groups' mon_data directory contains
sub-directories organized by scope. For example:
mon_data
├── mon_L3_00 <== monitoring data at scope L3
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
├── mon_L3_01 <== monitoring data at scope L3
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
├── mon_NODE_00 <== monitoring data at scope NODE
Does this mean the domain ID is "0", which corresponds to node0?
This seems to align with the presentation Fenghua's presentation at LPC,
where he mentioned that for CPU-less resctrl, the domain ID changes
from an L3 ID to a node ID.
│ └── mbm_total_bytes
└── mon_NODE_01 <== monitoring data at scope NODE
└── mbm_total_bytes
Please let me take this chance to elaborate on region-aware RDT
in more detail. I am wondering if the interface could be further
extended to support this feature.
A "region" can be defined as a set of physical addresses that
belong to the same memory tier. The region ID is per socket
(i.e., unique within a single socket). Suppose we have a 2-socket
platform as follows:
S0: 1LM Direct DDR ==> NUMA node 0
CXL HDM (Tier2) ==> NUMA node 2
S1: 1LM Direct DDR ==> NUMA node 1
CXL HDM (Tier2) ==> NUMA node 3
region0 on S0 is node0, region1 on S0 is node2,
region0 on S1 is node1, region1 on S1 is node3.
Let us assume that each socket has 2 LLC domains.
For example, S0 has LLC domain0 and LLC domain1,
S1 has LLC domain2 and LLC domain3.
We propose the following schemata:
<resource name>_<region>_<control>
for example,
MB_REGION1_OPT:0=511;1=510;2=509;3=508
it means, for LLC domain0 on S0, the throttle
level for node2(because region1 on S0 is node2)
is 511. For LLC domain2 on S1, the throttle
level for node3(because region1 is node2 on
S1 is node3) is 509.
Users could query the exact definition of REGION1
by checking the info directory.
info
└── MB
└── resource_schemata
├── MB_REGION1_OPT
│ ├── max
│ ├── min
│ ├── resolution
│ ├── scale
│ ├── scope <== "0=node2;1=node3" (node2 on S0, node3 on S1)
│ ├── tolerance
│ ├── type
│ └── unit
thanks,
Chenyu