Re: [RFC] mpam,x86,fs/resctrl: Generic schema description Proof of Concept
From: Ben Horgan
Date: Thu Jun 04 2026 - 07:25:24 EST
Hi Drew,
On 6/3/26 20:34, Drew Fustini wrote:
> On Wed, Jun 03, 2026 at 04:15:51PM +0100, Ben Horgan wrote:
>> Hi Reinette,
>>
>> On 5/29/26 19:06, Reinette Chatre wrote:
>>> Hi Everybody,
>>>
>>> It has been a while since we discussed the resctrl changes required to support
>>> hardware that has controls with fine granularity or hardware that has multiple
>>> controls per resource. For reference, the most recent email discussion can
>>> be found at [1] with a summary of discussions in last year's plumbers slides [2].
>>>
>>> I created a PoC that I believe supports what folks have agreed to so far. I
>>> hope this can help us to restart the discussion with the goal that resctrl gains
>>> support for upcoming hardware that require these features.
>>
>> Thank you very much for doing this work. I believe this will be very useful for
>> MPAM and other architectures.
>
> Yes, thanks to Reinette for working on the generic schema proof of
> concept. This will be helpful for supporting the RISC-V CBQRI (capacity
> and bandwidth QoS) spec.
>
>> I plumbed in support for the MB_MIN resource schema which also works under light
>> testing. The only fs resctrl code change I needed was:
>>
>> --- a/include/linux/resctrl.h
>> +++ b/include/linux/resctrl.h
>> @@ -483,6 +483,9 @@ static inline u32 resctrl_get_default_ctrlval(struct
>> resctrl_ctrl *ctrl)
>> case RESCTRL_CTRL_BITMAP:
>> return BIT_MASK(ctrl->cache.cbm_len) - 1;
>> case RESCTRL_CTRL_SCALAR:
>> + if (ctrl->name == RESCTRL_CTRL_NAME_MIN)
>> + return ctrl->membw.min_bw;
>> +
>> return ctrl->membw.max_bw;
>> }
>>
>>
>> At least on MPAM systems, we use a default of 0 for minimum bandwidth controls
>> as the maximum bandwidth controls only take effect if their value is higher than
>> the minimum bandwidth value. I have specialised this on the ctrl->name which
>> breaks your ctrl->type based classification but that's fixable by just adding a
>> default field to membw.
>
> This should be useful for RISC-V.
>
> RESCTRL_CTRL_NAME_MIN maps well to CBQRI Rbwb (reserved bandwidth
> blocks). The sum of Rbwb across all control groups must be less than
> MRBWB (maximum number of reserved bandwidth blocks). As a result, MB_MIN
> needs to default to 1 so that the sum does not violate that rule. In my
> RFC series, I added default_to_min to resctrl_membw [1] but this
> solution looks cleaner.
>
>>> - No support for "read-modify-write" usage of schemata file. This is where we
>>> discussed (without agreement) on possibly introducing the "#" prefix to schemata
>>> file entries. This PoC does not support this prefix and the current assumption/expectation
>>> is that when user space changes a configuration only the new control values are
>>> written to schemata file. I thus do not have a plan to support this so please
>>> share opinions in this regard if you have some.
>>
>> There is now less motivation from the MPAM side for this than when this was
>> initially discussed. In pre-upstream versions of the MPAM patches a change in
>> the MB resource control value would change both the mpam h/w mbw_min and mbw_max
>> values but now (on non-broken h/w) we just change the mbw_max. (mbw_min kept at 0).
>>
>> However, it would be useful not to be limited by percentages. In my quick
>> experimentation with your patches I used a percentage value for MB_MIN but it
>> would be best to move away from this. For new controls I think we can mandate
>> that user space has to discover the resolution from the info directly but how
>> can we retrofit this. For MPAM, MB and MB_MAX, would control the same things.
>> Could we just add MB_MAX with a h/w friendly scale and then reflect changes in
>> MB_MAX in MB and vica versa with MB taking precedent if both are set? Old
>> software can continue setting MB can move to using MB_MAX and take advantage of
>> the improved control. (I don't think we should expose the MPAM hardware value
>> directly as it has confusion over whether all 1s is 100% or not and we'd like to
>> have something generic and friendly to the user.)
>
> The facility for non-percentage value is import for RISC-V as CBQRI does
> not include percentage throttle. It has two controls for bandwidth:
>
> - Rbwb: number of reserved bandwidth blocks [1, 2^13]
> - Mweight: weighted share of the remaining bandwidth [0, 255]
> - 0: disables work-conserving sharing
> - 1..255: compete for the leftover pool
> - It makes for it to default to max (255) so that there won't be
> any unused bandwidth
>
> I think Mweight could be aligned with MPAM's proportional stride.
Yes, I hope so. There a few differences which would have to be considered.
MPAM doesn't have a concept of only applying the weights once reserved min
bandwidth is consumed. The interaction with min bandwidth is currently
unspecified. I don't think there are any designs where proportional bandwidth
and min b/w are on the same component and so it's only a theoretical/future problem.
For MPAM proportional stride, the higher the stride the lower the weight. We'll
have to make sure that whatever user configuration scale we provide works well
for both. If two PARTIDs have stride 2x and a third x then the 2 PARTIDS with
stride 2x together get the same bandidth as the third. Whereas, to get the same
in RISC-V the two partid would have weight y and the third 2y.
It's not specified for MPAM exactly what happens when you disable proportional
stride for a given PARTID.
The MPAM proportional control is work-conserving (the table in B.b RWKZBJ has
been confirmed as a spec mistake) and only corresponds to the current contenders
for bandwidth. From my reading of the CBQRI spec this is the same for RISC-V.
Thanks,
Ben
>
> Here is the patch I created to add Mweight support:
>
> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> index d95ab8ad36e2..3537071e3ab0 100644
> --- a/fs/resctrl/ctrlmondata.c
> +++ b/fs/resctrl/ctrlmondata.c
> @@ -304,6 +304,7 @@ static const char * const resctrl_ctrl_name[] = {
> [RESCTRL_CTRL_NAME_DEF] = "",
> [RESCTRL_CTRL_NAME_MIN] = "MIN",
> [RESCTRL_CTRL_NAME_MAX] = "MAX",
> + [RESCTRL_CTRL_NAME_WGHT] = "WGHT",
> };
>
> const char *resctrl_ctrl_name_str(enum resctrl_ctrl_name name)
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 72fb7256270e..09efcef9ce66 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -348,12 +348,14 @@ struct resctrl_mon {
> * has the same name as the resource.
> * @RESCTRL_CTRL_NAME_MIN: "MIN"
> * @RESCTRL_CTRL_NAME_MAX: "MAX"
> + * @RESCTRL_CTRL_NAME_WGHT: "WGHT"
> */
> enum resctrl_ctrl_name {
> RESCTRL_CTRL_NAME_DEF,
> RESCTRL_CTRL_NAME_MIN,
> RESCTRL_CTRL_NAME_MAX,
> - RESCTRL_CTRL_NAME_LAST = RESCTRL_CTRL_NAME_MAX
> + RESCTRL_CTRL_NAME_WGHT,
> + RESCTRL_CTRL_NAME_LAST = RESCTRL_CTRL_NAME_WGHT
> };
>
>>> - Controls are independent for now. This means that, for example, if a resource
>>> supports a "MIN" and "MAX" control then this implementation would allow user to
>>> set the "maximum" control values to be less than the "minimum" control values.
>>
>> I think this is ok as long as adding support for new controls in resctrl doesn't
>> change the existing behaviour. In MPAM we dodged this by introducing MB as only
>> affecting the h/w mbw_max and not mbw_min (as mentioned above).
>
> There is no equivalent to MB (percentage throttle) in RISC-V so I would
> want it to be valid to have MB_MIN (minimum reservation) without MB.
>
> I rebased my RISC-V CBQRI v6 series on top of this proof of concept and
> was able to validate it works okay in Qemu:
>
> MB_WGHT:72=255
> MB_MIN:72=756
> L2:64=fff;65=fff
> L3:75=ffff
>
> Thanks,
> Drew
>
> [1] https://lore.kernel.org/all/20260601-ssqosid-cbqri-rqsc-v7-0-v6-6-baf00f50028a@xxxxxxxxxx/