Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

From: Reinette Chatre
Date: Fri Jul 12 2024 - 18:04:11 EST


Hi Babu,

On 7/3/24 2:48 PM, Babu Moger wrote:
# Linux Implementation

Linux resctrl subsystem provides the interface to count maximum of two
memory bandwidth events per group, from a combination of available total
and local events. Keeping the current interface, users can enable a maximum
of 2 ABMC counters per group. User will also have the option to enable only
one counter to the group. If the system runs out of assignable ABMC
counters, kernel will display an error. Users need to disable an already
enabled counter to make space for new assignments.

The implementation appears to be converging on an interface that can
be generic enough to be used by other features discussed along the way.
"Linux implementation" summary can thus add:

Create a generic interface aimed to support user space assignment
of scarce counters used for monitoring. First usage of interface
is by ABMC with option to expand usage to "soft-RMID" and MPAM
counters in future.


# Examples

a. Check if ABMC support is available
#mount -t resctrl resctrl /sys/fs/resctrl/

#cat /sys/fs/resctrl/info/L3_MON/mbm_mode
[abmc]
legacy

Linux kernel detected ABMC feature and it is enabled.

How about renaming "abmc" to "mbm_cntrs"? This will match the num_mbm_cntrs
info file and be the final step to make this generic so that another architecture
can more easily support assignining hardware counters without needing to call
the feature AMD's "abmc".

Expanding on this it may be possible to add a new "sw_mbm_cntrs" feature that
will be the "soft-RMID" feature while also reflecting the "mbm_cntrs" name
so that when user space enables that feature its properties can be found in
"num_mbm_cntrs".

The "abmc" kernel parameter remains but that does seem separate from this
resctrl fs feature since it is explicitly tied to X86_FEATURE_ABMC surely
making it architecture specific.


b. Check how many ABMC counters are available.

#cat /sys/fs/resctrl/info/L3_MON/num_cntrs
32

This is now num_mbm_cntrs


c. Create few resctrl groups.

# mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp
# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp
# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp


d. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_control
to list and modify the group's monitoring states. File provides single place
to list monitoring states of all the resctrl groups. It makes it easier for
user space to learn about the counters are used without needing to traverse
all the groups thus reducing the number of filesystem calls.

The list follows the following format:

"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"

Format for specific type of groups:

* Default CTRL_MON group:
"//<domain_id>=<flags>"

* Non-default CTRL_MON group:
"<CTRL_MON group>//<domain_id>=<flags>"

* Child MON group of default CTRL_MON group:
"/<MON group>/<domain_id>=<flags>"

* Child MON group of non-default CTRL_MON group:
"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"

Flags can be one of the following:

t MBM total event is enabled.
l MBM local event is enabled.
tl Both total and local MBM events are enabled.
_ None of the MBM events are enabled

The language needs to be changed here (and in the many copied places) to
be specific about what setting the flag accomplishes. For example, in
"legacy" mode user space can be expected to find all events enabled, no?
Needing a new feature to set a flag to accomplish something that is
possible in legacy mode can thus cause confusion.

If I understand the implementation reading "mbm_control" will fail
if system is ABMC capable but it is disabled. Why can "mbm_control" not
always be displayed to user space? For example, what if "mbm_control" is
always available to user space and it can provide specific information to
user space. For example:
t MBM total event is enabled but may not always be counted.
T MBM total event is enabled and being counted.

On AMD systems resource groups will have "t" associated with monitor
groups when ABMC disabled, "T" when ABMC enabled and a counter assigned.
On Intel systems monitor groups will always have "T".

For "soft-RMID" the flag could possible continue to be "T"?

I am trying to find ways to communicate to user space consistently
and clearly and any insights will be appreciated. We really do not want
to add this interface and then find that it just causes confusion.

It is not quite obvious to me when the new files should be visible and
what they should present to the user. "mbm_mode" is now always visible.
Should "num_mbm_cntrs" not also always be visible? Right now "num_mbm_cntrs"
appears to be only associated to ABMC, should it not also, for example,
be the file that "soft-RMID" may use to share how many counters are
available? Its contents will thus be dynamic based on which "MBM mode" is
active, begging the question, what should it contain when "legacy" mode is
enabled, should "num_mbm_cntrs" perhaps show "0" to user space when
"legacy" mode is active?



Examples:

# cat /sys/fs/resctrl/info/L3_MON/mbm_control
non_default_ctrl_mon_grp//0=tl;1=tl;
non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
//0=tl;1=tl;
/child_default_mon_grp/0=tl;1=tl;

There are four groups and all the groups have local and total
event enabled on domain 0 and 1.

"local and total event" is vague, can it be made specific with, for example,
"local and total MBM events"


=tl means both total and local events are enabled.

Same here (and all copied places in this series)


"//" - This is a default CTRL_MON group

"non_default_ctrl_mon_grp//" - This is non-default CTRL_MON group

"/child_default_mon_grp/" - This is Child MON group of the defult group

Same typos as in previous version of cover letter.


"non_default_ctrl_mon_grp/child_non_default_mon_grp/" - This is child
MON group of the non-default group

e. Update the group assignment states using the interface file /sys/fs/resctrl/info/L3_MON/mbm_control.

The write format is similar to the above list format with addition of
op-code for the assignment operation.

* Default CTRL_MON group:
"//<domain_id><op-code><flags>"

* Non-default CTRL_MON group:
"<CTRL_MON group>//<domain_id><op-code><flags>"

* Child MON group of default CTRL_MON group:
"/<MON group>/<domain_id><op-code><flags>"

* Child MON group of non-default CTRL_MON group:
"<CTRL_MON group>/<MON group>/<domain_id><op-code><flags>"

Op-code can be one of the following:

= Update the assignment to match the flag.
+ Assign a new state.
- Unassign a new state.

Please be consistent with terminology. Above switches between "flag"
and "state" while it then continues below using "event". Also,
"Unassign a _new_ state" is unexpected, it should probably be an
_existing_ (not "new") state/flag/event?


Flags can be one of the following:

t MBM total event.
l MBM local event.
tl Both total and local MBM events.
_ None of the MBM events. Only works with '=' op-code.

Initial group status:
# cat /sys/fs/resctrl/info/L3_MON/mbm_control
non_default_ctrl_mon_grp//0=tl;1=tl;
non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
//0=tl;1=tl;
/child_default_mon_grp/0=tl;1=tl;

To update the default group to enable only total event on domain 0:
# echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_control

Assignment status after the update:
# cat /sys/fs/resctrl/info/L3_MON/mbm_control
non_default_ctrl_mon_grp//0=tl;1=tl;
non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
//0=t;1=tl;
/child_default_mon_grp/0=tl;1=tl;

To update the MON group child_default_mon_grp to remove total event on domain 1:
# echo "/child_default_mon_grp/1-t" > /sys/fs/resctrl/info/L3_MON/mbm_control

Assignment status after the update:
$ cat /sys/fs/resctrl/info/L3_MON/mbm_control
non_default_ctrl_mon_grp//0=tl;1=tl;
non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
//0=t;1=tl;
/child_default_mon_grp/0=tl;1=l;

To update the MON group non_default_ctrl_mon_grp/child_non_default_mon_grp to
remove both local and total events on domain 1:
# echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/1=_" >
/sys/fs/resctrl/info/L3_MON/mbm_control

Assignment status after the update:
non_default_ctrl_mon_grp//0=tl;1=tl;
non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
//0=t;1=tl;
/child_default_mon_grp/0=tl;1=l;

To update the default group to add a local event domain 0.
# echo "//0+l" > /sys/fs/resctrl/info/L3_MON/mbm_control

Assignment status after the update:
# cat /sys/fs/resctrl/info/L3_MON/mbm_control
non_default_ctrl_mon_grp//0=tl;1=tl;
non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
//0=tl;1=tl;
/child_default_mon_grp/0=tl;1=l;


f. Read the event mbm_total_bytes and mbm_local_bytes of the default group.
There is no change in reading the events with ABMC. If the event is unassigned
when reading, then the read will come back as "Unassigned".

# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
779247936
# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
765207488

g. Users will have the option to go back to legacy mbm_mode if required.
This can be done using the following command. Note that switching the
mbm_mode will reset all the mbm counters of all resctrl groups.

mbm -> MBM (throughout)


# echo "legacy" > /sys/fs/resctrl/info/L3_MON/mbm_mode
# cat /sys/fs/resctrl/info/L3_MON/mbm_mode
abmc
[legacy]

h. Check the bandwidth configuration for the group. Note that bandwidth
configuration has a domain scope. Total event defaults to 0x7F (to
count all the events) and local event defaults to 0x15 (to count all
the local numa events). The event bitmap decoding is available at
https://www.kernel.org/doc/Documentation/x86/resctrl.rst
in section "mbm_total_bytes_config", "mbm_local_bytes_config":

#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
0=0x7f;1=0x7f

#cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
0=0x15;1=0x15

j. Change the bandwidth source for domain 0 for the total event to count only reads.
Note that this change effects total events on the domain 0.

#echo 0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
0=0x33;1=0x7F

k. Now read the total event again. The first read will come back with "Unavailable"
status. The subsequent read of mbm_total_bytes will display only the read events.

#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
Unavailable
#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
314101

l. Unmount the resctrl

#umount /sys/fs/resctrl/


Reinette