Re: [RFC] mpam,x86,fs/resctrl: Generic schema description Proof of Concept
From: Reinette Chatre
Date: Fri Jun 05 2026 - 11:47:52 EST
Hi Ben,
On 6/5/26 7:53 AM, Ben Horgan wrote:
> On 6/4/26 18:43, Reinette Chatre wrote:
>> On 6/3/26 8:15 AM, Ben Horgan wrote:
>>> On 5/29/26 19:06, Reinette Chatre wrote:
...
>>
>>> I plumbed in support for the MB_MIN resource schema which also works under light
>>> testing. The only fs resctrl code change I needed was:
>>>
>>> --- a/include/linux/resctrl.h
>>> +++ b/include/linux/resctrl.h
>>> @@ -483,6 +483,9 @@ static inline u32 resctrl_get_default_ctrlval(struct
>>> resctrl_ctrl *ctrl)
>>> case RESCTRL_CTRL_BITMAP:
>>> return BIT_MASK(ctrl->cache.cbm_len) - 1;
>>> case RESCTRL_CTRL_SCALAR:
>>> + if (ctrl->name == RESCTRL_CTRL_NAME_MIN)
>>> + return ctrl->membw.min_bw;
>>> +
>>> return ctrl->membw.max_bw;
>>> }
>>>
>>>
>>> At least on MPAM systems, we use a default of 0 for minimum bandwidth controls
>>> as the maximum bandwidth controls only take effect if their value is higher than
>>> the minimum bandwidth value. I have specialised this on the ctrl->name which
>>> breaks your ctrl->type based classification but that's fixable by just adding a
>>> default field to membw.
>>
>> This I am not sure about. In my understanding a typical "default" value means
>> "no throttling" and, at least on Intel, this default hardware state has been
>> summarized as "min" == "max" == "optimal".
>
> Ok, this sounds odd to me but that is probably because I don't know what Intel
> systems do. On MPAM systems a MIN control is a boost rather than a throttling
> control. Although, you can always think of that as throttling the traffic with
> the other PARTIDs.
>
>>
>> Are you saying that on MPAM systems if "min" == "max" then max bandwidth controls
>> do not take effect? Could you please elaborate what happens if "min" == "max"?
>
> Table 5-4 from section 5.2.8 of the IHI0099B.b shows the interaction between the
> min and maximum controls.
>
> If used bandwidth is The preference is Description
> Below the minimum High Only high requests compete with this
> request.
> Above the minimum:
> Below the maximum Medium High requests are serviced first then
>
> this request competes with other
> medium requests.
>
> Above the maximum, Low Requests are not serviced if any high
> when HARDLIM is 0 or medium requests are available.
>
> Above the maximum, None Requests are not serviced
> when HARDLIM is 1
>
> So if we keep the minimum and the maximum controls values always the same then
> all traffic will be given "high" preference until the target bandwidth is
> reached. For some MPAM systems it is recommended to set the minimum value as 5%
> less than the maximum value to get a reliable target bandwidth. As 5% seems
> implementation specific and some systems don't have min controls it seemed
> better to just match the MB control with a maximum bandwidth control and let the
> user have freedom to choose the minimum bandwidth control when MB_MIN support is
> added.
>
> If a default for the minimum of the maximum possible bandwidth is used (100%)
> then any change of the maximum won't have any effect as it's always less than
> minimum (if that's unchanged) and so all traffic is high preference. I now see
> from your reply below that you are planning on not allowing this kind of
> configuration.
>
> If the minimum always tracks the maximum then we lose the distinction between
> medium and high preference traffic and so to reserve some high preference
> bandwidth for one control group we'd have to change the configuration in the
> other controls groups so that they're bandwidth preference is medium (minimum
> value at 0).
I do not think we are talking about the same thing here. I am *not* saying
that minimum and maximum controls should always be the same.
The discussion is about a proposed change to resctrl_get_default_ctrlval(). resctrl
uses this function in two places:
- When creating a new resource group:
The intention here is that when user space creates a new resource group it should
be created with maximum allocations possible. For MBA this means "unthrottled".
After creating the resource group user space can adjust allocations to match
workload requirements.
- When unmounting the resctrl fs.
The intention here is that all controls are set to unthrottled to stop any possible
impact to system when user space stops using resctrl.
resctrl_get_default_ctrlval() is thus intended to support an unthrottled baseline from
where user space can make configuration changes as supported by hardware and required
by workloads.
I see that the MPAM driver internally uses resctrl_get_default_ctrlval() in a couple
of places and I am not considering this usage here. If internally MPAM has other
usages for this function where it does not mean "unthrottled" then perhaps
it would be better to create a new function that matches the usage?
>>>> - No support for "read-modify-write" usage of schemata file. This is where we
>>>> discussed (without agreement) on possibly introducing the "#" prefix to schemata
>>>> file entries. This PoC does not support this prefix and the current assumption/expectation
>>>> is that when user space changes a configuration only the new control values are
>>>> written to schemata file. I thus do not have a plan to support this so please
>>>> share opinions in this regard if you have some.
>>>
>>> There is now less motivation from the MPAM side for this than when this was
>>> initially discussed. In pre-upstream versions of the MPAM patches a change in
>>> the MB resource control value would change both the mpam h/w mbw_min and mbw_max
>>> values but now (on non-broken h/w) we just change the mbw_max. (mbw_min kept at 0).
>>
>> Ah, thanks for the correction. The email I linked above indeed refers to changing
>> both min and max.
>>
>>>
>>> However, it would be useful not to be limited by percentages. In my quick
>>
>> Indeed. Not being limited by percentages while still needing to have a backward
>> compatible user interface is how we ended up with "emulated controls".
>>
>>> experimentation with your patches I used a percentage value for MB_MIN but it
>>> would be best to move away from this. For new controls I think we can mandate
>>> that user space has to discover the resolution from the info directly but how
>>> can we retrofit this. For MPAM, MB and MB_MAX, would control the same things.
>>> Could we just add MB_MAX with a h/w friendly scale and then reflect changes in
>>> MB_MAX in MB and vica versa with MB taking precedent if both are set? Old
>>> software can continue setting MB can move to using MB_MAX and take advantage of
>>> the improved control. (I don't think we should expose the MPAM hardware value
>>> directly as it has confusion over whether all 1s is 100% or not and we'd like to
>>> have something generic and friendly to the user.)
>>
>> Sounds to me as though you are describing emulated controls. Exposing two
>> controls in schemata file that essentially controls the same thing is what the
>> emulated controls aim to solve and the resctrl hierarchies presented in slide #6
>> of that presentation (and discussed in the email thread) is how we contemplated how
>> to represent the relationship among these controls to user space. So, considering
>> your example resctrl may display something like:
>>
>> info//
>> └── MB/
>> └── resource_schemata/
>> └── MB/
>> └── MB_MAX/
>>
>> Above hierarchy describes the relationship to user space that if MB is changed it
>> will impact MB_MAX and vice-versa.
>>
>> The one open I am aware of surrounding emulated controls is how to present some
>> semblance of consistency to user space when considering all the possibilities
>> the different architectures (and even within architectures) may have.
>
> What other use cases do we have apart from MB and MB_MAX? I was wondering if
> this could be limited to a default control (L2, L3, MB..) with a single new
> style control (L2_*, L3_*, MB_ ...) under it.
The motivation for these emulated controls is to not break a user space that does
not understand the "info/<resource>/resource_schemata" interface. At this time
user space expects every resource (not control) to have an entry in the schemata file.
So yes, I also see this as limited to the default control.
Whether it implies that only a single (finer grained/hardware) control would be under
it is not obvious to me since we already had one scenario where a legacy control is
emulated by two hardware controls when considering the example on MPAM where the "MB"
legacy control can be emulated with MPAM's "min" and "max" controls. An additional
complication is that some of these architecture specs describe several controls but have
their implementation as "optional" which presents a challenge when trying to create a
sane and consistent hierarchy.
>>>> - Controls are independent for now. This means that, for example, if a resource
>>>> supports a "MIN" and "MAX" control then this implementation would allow user to
>>>> set the "maximum" control values to be less than the "minimum" control values.
>>>
>>> I think this is ok as long as adding support for new controls in resctrl doesn't
>>> change the existing behaviour. In MPAM we dodged this by introducing MB as only
>>> affecting the h/w mbw_max and not mbw_min (as mentioned above).
>>
>> I understand this to be a requirement for Intel where the spec contains "The Maximum Cap
>> should be programmed to be greater than or equal to the Minimum and Optimal caps.
>> Undesirable and undefined performance effects may result if cap programming guidelines
>> are not followed."
>>
>> I am currently thinking that resctrl should not try to be too smart here and if user
>> space wants to make dramatic changes to min and max values then it should just ensure
>> the ordering is appropriate. For example, attempting to set a new min to be larger than
>> the old max would fail and user space should first increase the old max and then set
>> a new min.
>
> Ok with me.
Thank you for considering this.
...
>>>> Primary resctrl fs data structure changes
>>>> =========================================
>>>>
>>>> Introduces a control represented by struct resctrl_ctrl that looks as below. To make
>>>> the changes easier to follow I kept some of the original names to help communicate
>>>> where familiar data structures land.
>>>>
>>>> What to notice about a control is that it has some common properties required
>>>> from all controls (scope, type, etc.) and then depending on the type of control
>>>> (RESCTRL_CTRL_BITMAP or RESCTRL_CTRL_SCALAR) there are type specific properties.
>>>>
>>>> /**
>>>> * struct resctrl_ctrl - A resource control
>>>> * @entry: List entry of rdt_resource::controls
>>>> * @scope: Scope of the resource that this control allocates
>>>> * @domains: RCU list of all control domains
>>>> * @type: The control type that determines the properties of the control,
>>>> * format string for displaying control values to user space, and
>>>> * parser of control values provided by user space.
>>>> * @name: Name of the control. Appended to final resource name
>>>> * (rdt_resource_final::name) to create final schema entry.
>>>> * Specifically, "rdt_resource_final::name"_"resctrl_ctrl::name".
>>>> * For example, with resource name "MB" and control name "MAX" the
>>>> * schema entry will be "MB_MAX".
>>>> * @cache: Cache allocation control properties.
>>>> * @membw: Bandwidth control properties.
>>>> */
>>>> struct resctrl_ctrl {
>>>> struct list_head entry;
>>>> enum resctrl_scope scope;
>>>> struct list_head domains;
>>>> enum resctrl_ctrl_type type;
>>>> enum resctrl_ctrl_name name;
>>>> union {
>>>> struct resctrl_cache cache;
>>>> struct resctrl_membw membw;
>>>> };
>>>> };
>>>>
>>>> Two members summarize how this new structure fits into the rest of resctrl:
>>>> a) resctrl_ctrl::entry
>>>> Since a resource can support multiple controls there is a new list
>>>> in struct rdt_resource named "controls" that contains the list of all
>>>> controls supported by the resource.
>>>> b) resctrl_ctrl::domains
>>>> Instead of the list of control domains belonging to a resource they
>>>> now belong to the control self. By doing so resctrl can support resource
>>>> controls at different scope for the same resource. This is intended to
>>>> support some upcoming MPAM and RISC-V usages.
>>>
>>> Please can you expand a bit on part b).
>>>
>>> In an MPAM system we consider 3 resctrl resources, RDT_RESOURCE_L3,
>>> RDT_RESOURCE_L2 and RDT_RESOURCE_MBA which correspond to the L3 caches, L2
>>> caches and memory bandwidth on egress from the L3 caches. The domain for each of
>>> these corresponds to the instance of the resource. That is, for RDT_RESOURCE_L2
>>> there is a resource for each L2 instance, similarly for L3, and for
>>
>> (I'm assuming above is typo and it is "there is a domain for each L2 instance"?)
>
> yes, a mistake
>
>>
>>> RDT_RESOURCE_MBA there is a domain for each L3 cache. If we were to add suport
>>> for controls on a new cache level, say the L4, then I'd expect to add a new
>>> resource. For memory bandwidth, we'd like to be able to control b/w on the L2
>>> egress (e.g. in a DSU). Wouldn't this too be a separate resource or would this
>>> be a new set of controls on the same resource?
>>>
>>> New controls on the same resource
>>> MB_MIN2
>>> MB_MAX2
>>> MB_PROP2
>>> ...
>>>
>>> or
>>> MB2_MIN
>>> MB2_MAX
>>> MB2_PROP
>>
>>
>> The way I currently see it is that controlling bandwidth at a different scope would
>> be a new set of controls associated with the MB resource. There are more scenarios
>> coming this way with AMD's "Global MBA" that is memory bandwidth allocation at
>> NUMA node scope. If I understand correctly the "CPU-less Memory Node" that Nvidia
>> shared at plumbers would need this also and control memory bandwidth allocation
>> at the NUMA node scope.
>
> Yes, in general for MSC at the memory controlers it would be good to scope these
> by NUMA node whether or not they are CPU-less or not.
>
>> A related technology is Intel's region-aware MBA, which is
>> still at L3 scope.
>>
>> I fully agree that we need to figure out how to represent all of this to user space
>> without turning the interface into something unintelligible. In the end this is
>> required for user space to know what a domain ID represents.
>>
>> Would it help to make the scope part of the control name? The ship has sailed for
>> MB being associated with L3 scope but this could mean the "default" scope of MB
>> resource is L3 (which user space can still confirm by looking at the control's
>> "scope" file) and the others include scope in the name? Consider for example:
>> https://lore.kernel.org/lkml/fb1e2686-237b-4536-acd6-15159abafcba@xxxxxxxxx/
>
> This certainly helps with the naming.
>
> The scope does have an effect on what causes a domain to be present or not. For
> existing scopes, such as L3 scope, that whether a domain is online or not is
> dependent on whether or not a set cpu is online and the cpu_read_lock is taken.
> However, for NUMA scope in MPAM (maybe not GMBA?) then whether or not the domain
> is online would need to depend on whether the memory is online or not and the
> memory hotplug lock will be needed to be taken. I am wondering if this sort of
> configuration means it's better to have the NUMA scoped memory bandwidth on a
> different resource or we just say ok and always take the memory hotplug lock ,
> get_online_mems(), where we take the cpu_read_lock.
oh, thank you for bringing this up. I have not considered how the memory hotplug lock
needs to be integrated. Taking cpus_read_lock() has permeated the entire subsystem.
My initial thought is that having unique per-resource locking sounds complicated while
always taking memory hotplug lock sounds much simpler. I do not see many users of
get_online_mems() though.
>>> AFAIK, the DSU h/w just supports proportional bandwidth controls at the moment
>>> but we should consider what to do about the potential naming.
>>
>> ack.
>>
>>>
>>> In the MPAM driver, we collect MSC into components (based on instances) and
>>> those into classes (components of the same type). Currently, a resource is
>>> mapped to a single class. (Two resources may map to the same class.)
>>>
>>> I expect it is useful in the memory region and sub numa cases but I'd still
>>> expect the common case to be that the domains are the same within a control. Or
>>> am I missing something?
>>
>> Domains of a control should all be at the same scope. Since the schemata file
>> exposes the control with the different IDs representing the instances of the
>> resource needing to be controlled it has to be clear to user space what the
>> domain ID represents.
>
> Agreed. (I meant to say the domains within a resource are likely to be the same
> for each control within the same resource.)
This seems accurate for the resources that have implicit scope (the caches) but
memory bandwidth as a resource is looking more like it needs to support allocation
at different scopes.
Reinette