Re: [PATCH v1 1/4] x86,fs/resctrl: Make resctrl_arch_is_evt_configurable() aware of mbm_assign_mode

From: Ben Horgan

Date: Thu Mar 05 2026 - 05:06:47 EST


Hi Reinette,

On 3/4/26 22:50, Reinette Chatre wrote:
> Hi Ben,
>
> On 3/4/26 1:01 PM, Ben Horgan wrote:
>> Hi Reinette,
>>
>> On 3/4/26 19:23, Reinette Chatre wrote:
>>> Hi Ben,
>>>
>>> On 3/4/26 9:37 AM, Ben Horgan wrote:
>>>> On 3/4/26 17:02, Reinette Chatre wrote:
>>>>> On 3/4/26 3:07 AM, Ben Horgan wrote:
>>>>>> On 3/3/26 18:09, Reinette Chatre wrote:
>>>>>>> On 3/3/26 4:29 AM, Ben Horgan wrote:
>>>>>>>> On Mon, Mar 02, 2026 at 03:11:48PM -0800, Reinette Chatre wrote:
>>>>>>>>> Hi Ben,
>>>>>>>>>
>>>>>>>>> On 2/25/26 12:19 PM, Ben Horgan wrote:
>>>>>>>>>> The features BMEC and ABMC provide separate interfaces to configuring which
>>>>>>>>>> bandwidth types a counter tracks. Currently
>>>>>>>>>> resctrl_arch_is_evt_configurable() only ever returns true if BMEC is
>>>>>>>>>> supported.
>>>>>>>>>>
>>>>>>>>>> ABMC is useful even when BMEC is supported as it also provides counter
>>>>>>>>>> assignment which reduces the number of hardware monitors a system
>>>>>>>>>> requires. It is an architectural detail that ABMC provides counter
>>>>>>>>>
>>>>>>>>> Since the goal is to support MPAM I'd suggest that the first focus be on what
>>>>>>>>> resctrl fs supports and exposes and how it does or does not work for MPAM.
>>>>>>>>>
>>>>>>>>>> configurability without requiring the prior feature, BMEC. On MPAM systems
>>>>>>>>>> these two features are independent and the bandwidth types are limited to a
>>>>>>>>>> choice of only read or write.
>>>>>>>>>
>>>>>>>>> Does MPAM support exactly these two features? Specifically, does MPAM support
>>>>>>>>> a feature that allows user to configure events globally per domain and another
>>>>>>>>> feature that allows user to configure events per PMG?
>>>>>>>>
>>>>>>>> No, the bandwidth type configuration in MPAM is per counter and so effectively
>>>>>>>> per (PARTID, PMG) pair. In supporting hardware, the configuration is made in the
>>>>>>>> RWBW field of MSMON_CFG_MBWU_FLT and allows counting of just read, just write,
>>>>>>>> or both.
>>>>>>>
>>>>>>> Thank you for confirming.
>>>>>>>
>>>>>>> Since BMEC event configuration is per domain I do not believe BMEC is relevant to MPAM.
>>>>>>>
>>>>>>>
>>>>>>>>> These different features are how I understand assignable counters and BMEC to
>>>>>>>>
>>>>>>>> We are each approaching this from a different view point. I've just been looking at
>>>>>>>> ABMC as a way of dealing with systems where there are fewer hardware counters than
>>>>>>>> (PARTID, PMG) pairs (num_rmid) by requiring a counter to be assigned to a
>>>>>>>> CTRL_MON or MON group in order to be usable. resctrl otherwise expects a counter
>>>>>>>> per CTRL_MON/MON group. Sharing bandwidth counters doesn't work
>>>>>>>
>>>>>>> No, resctrl does not expect a counter per CTRL_MON/MON group - in assignable
>>>>>>> counter mode the counter assignment is per monitoring group AND event as a pair:
>>>>>>> (CTRL_MON/MON group, event).
>>>>>>
>>>>>> Yes but these counters aren't necessarily fungible. For MPAM the
>>>>>> mbm_local_bytes and mbm_total_bytes are necessarily backed by different
>>>>>> hardware counters. A MPAM bandwidth counters just counts all traffic on
>>>>>> a link with the only configurability being for read/write. The counters
>>>>>> are just placed at different point in the topology to get the different
>>>>>> events.
>>>>>
>>>>> The distinction between "different hardware counters for mbm_local_bytes and
>>>>> mbm_total_bytes" and "The counters are just placed at different point in the
>>>>> topology" is not clear to me". The former implies different counters for the
>>>>> two events while the latter implies the same counters are used for both events
>>>>> but perhaps accumulated/displayed differently?
>>>>
>>>> For a given RIS, mpam device hardware unit of which an MSC may consist
>>>> of 1 or more, there are MPAMF_MBWUMON_IDR.NUM_MON hardware bandwidth
>>>> counters which measure traffic passing a specific point with no
>>>> filtering for where it's going. The filtering of this counter is
>>>> set up in MSMON_CFG_MBWU_FLT which only allows pmg/partid/(read/write).
>>>
>>> Thank you for the details. Is the expectation that user should be able to
>>> program all these counters via resctrl? If an MSC consists of multiple RIS
>>> with different counters then things get complicated very fast. Could it be
>>> constrained to only expose the maximum number of counters supported by
>>> all RIS at a particular scope? This would match what the existing
>>> num_mbm_cntrs file supports.
>>
>> Not individually, no, they will generally just be one per cache slice or
>> memory controller and all be programmed together as a component.
>
> Is this where the risk of double counting comes in? That is, adding up the
> memory bandwidth at the cache to the memory bandwidth at memory controller
> for a total memory bandwidth count?
>

Not double counting, so much. The problem is more about using these at
the same time. We were initially thinking that if the memory controller
topology matched that of the l3 caches then we could have
mbm_local_bytes and mbm_total_bytes at the other but we realised we
weren't counting the right things. (Where 'topology matches' means that
there is a pairing between numa nodes and l3 cache where within each
pair they have the same affine cpus.) This would have led to having more
than one pool of hardware counter for memory bandwidth counters that
are, effectively, at the l3 cache. Going forward there are ideas about
placing the MSC in different places in the design which are logically
the l3 cache but mean that different bandwidth types could be counted
but this would need firmware description help (device tree/acpi) so very
much future.

For the moment the only abuse we do around this in the MPAM driver is
that if there is a single l3 and a single numa node then we say that an
MSC counting traffic at the entry to the memory is the same as one at
the exit from the l3 (assuming l3 is the last level cache).

> ...
>
>
>> So, to try and bring this back to what we can be done now for MPAM to
>> fit into the counter mode assignment interface. Just support
>> mbm_total_bytes and then num_mbm_cntrs is correct (nothing to do). Make
>> the event_filter file always display all the bandwidth types and make
>> that the only value that be the only value it accepts (instead of hiding
>> the event_filter file). If you agree I'll respin with that.
>
> From resctrl side this sounds fine. I don't have any insight into what, if any,
> kind of gymnastics the MPAM driver needs to do to make the discovered MSCs with
> their varying scope and internal vs external counts fit into this. If initial
> implementation indeed forces some components into categories that are not a good
> match then when resctrl later does get support for diverse components there may
> be surprises to user space along the way. For example, user space may not see the
> same memory bandwidth numbers reported by the same events on the same system as
> the interface evolves.

Indeed, we have already weeded a few things out of the MPAM driver for
similar reasons. If we start with mpam only supporting a
non-configurable mbm_total_bytes with ABMC I think we're ok. I'll drop
the non-ABMC bandwidth counter support from the MPAM driver as even if
we've got enough counters, one per (CTRL_MON/MON, evt), we can use ABMC.
Also, when event configuration (read/write filtering) using user defined
(or new) events is added this will mean that enough counters becomes a
higher limit. That will mean that the software controller is not usable
but for now I think we can just fail when that mount option, mba_MBps,
is used. Later we can consider using non-ABMC bandwidth counters when
the software controller is requested.

>
> "make that the only value that be the only value it accepts" - are you saying that
> whatever is displayed when user views the "event_filter" file is what the
> user can write to the "event_filter" file? I find this a challenging interface
> for user space to use. The expectation is that the user can write any supported
> memory transaction to that file and when writing fails it can only be because
> of an invalid memory transaction. How can user space know that events are not
> configurable at all? It sounds as though user space is expected to try configuring
> the event with a memory transaction and then, presumably, check last_cmd_status?
>
> Could this not be simplified by making the "event_filter" file read-only on
> MPAM systems?

Yes, we'll need some finer grained control for which sets of bandwidth
types can be configured further down the line but going with read-only
for when there is only one fixed set seems good to me.

>
> Reinette
>

Thanks,

Ben