Re: [PATCH v8 25/25] x86/resctrl: Introduce interface to modify assignment states of the groups

From: Reinette Chatre
Date: Tue Oct 15 2024 - 23:43:59 EST


Hi Babu,

On 10/9/24 10:39 AM, Babu Moger wrote:
> Introduce the interface to assign MBM events in mbm_cntr_assign mode.
>
> Events can be enabled or disabled by writing to file
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> Format is similar to the list format with addition of opcode for the
> assignment operation.
> "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
>
> Format for specific type of groups:
>
> * Default CTRL_MON group:
> "//<domain_id><opcode><flags>"
>
> * Non-default CTRL_MON group:
> "<CTRL_MON group>//<domain_id><opcode><flags>"
>
> * Child MON group of default CTRL_MON group:
> "/<MON group>/<domain_id><opcode><flags>"
>
> * Child MON group of non-default CTRL_MON group:
> "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
>
> Domain_id '*' will apply the flags on all the domains.
>
> Opcode can be one of the following:
>
> = Update the assignment to match the flags
> + Assign a new MBM event without impacting existing assignments.
> - Unassign a MBM event from currently assigned events.
>
> Assignment flags can be one of the following:
> t MBM total event
> l MBM local event
> tl Both total and local MBM events
> _ None of the MBM events. Valid only with '=' opcode. This flag cannot
> be combined with other flags.
>
> Signed-off-by: Babu Moger <babu.moger@xxxxxxx>
> ---
> v8: Moved unassign as the first action during the assign modification.
> Assign none "_" takes priority. Cannot be mixed with other flags.
> Updated the documentation and .rst file format. htmldoc looks ok.
>
> v7: Simplified the parsing (strsep(&token, "//") in rdtgroup_mbm_assign_control_write().
> Added mutex lock in rdtgroup_mbm_assign_control_write() while processing.
> Renamed rdtgroup_find_grp to rdtgroup_find_grp_by_name.
> Fixed rdtgroup_str_to_mon_state to return error for invalid flags.
> Simplified the calls rdtgroup_assign_cntr by merging few functions earlier.
> Removed ABMC reference in FS code.
> Reinette commented about handling the combination of flags like 'lt_' and '_lt'.
> Not sure if we need to change the behaviour here. Processed them sequencially right now.
> Users have the liberty to pass the flags. Restricting it might be a problem later.
>
> v6: Added support assign all if domain id is '*'
> Fixed the allocation of counter id if it not assigned already.
>
> v5: Interface name changed from mbm_assign_control to mbm_control.
> Fixed opcode and flags combination.
> '=_" is valid.
> "-_" amd "+_" is not valid.
> Minor message update.
> Renamed the function with prefix - rdtgroup_.
> Corrected few documentation mistakes.
> Rebase related changes after SNC support.
>
> v4: Added domain specific assignments. Fixed the opcode parsing.
>
> v3: New patch.
> Addresses the feedback to provide the global assignment interface.
> https://lore.kernel.org/lkml/c73f444b-83a1-4e9a-95d3-54c5165ee782@xxxxxxxxx/
> ---
> Documentation/arch/x86/resctrl.rst | 115 +++++++++++-
> arch/x86/kernel/cpu/resctrl/internal.h | 10 ++
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 233 ++++++++++++++++++++++++-
> 3 files changed, 356 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index b85d3bc3e301..77bb0b095127 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -336,7 +336,8 @@ with the following files:
> t MBM total event is assigned.
> l MBM local event is assigned.
> tl Both total and local MBM events are assigned.
> - _ None of the MBM events are assigned.
> + _ None of the MBM events are assigned. Only works with opcode '=' for write
> + and cannot be combined with other flags.
>
> Examples:
> ::
> @@ -354,6 +355,118 @@ with the following files:
> There are four resctrl groups. All the groups have total and local MBM events
> assigned on domain 0 and 1.
>
> + Assignment state can be updated by writing to the interface.
> +
> + Format is similar to the list format with addition of opcode for the
> + assignment operation.
> +
> + "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
> +
> + Format for each type of groups:
> +
> + * Default CTRL_MON group:
> + "//<domain_id><opcode><flags>"
> +
> + * Non-default CTRL_MON group:
> + "<CTRL_MON group>//<domain_id><opcode><flags>"
> +
> + * Child MON group of default CTRL_MON group:
> + "/<MON group>/<domain_id><opcode><flags>"
> +
> + * Child MON group of non-default CTRL_MON group:
> + "<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
> +
> + Domain_id '*' will apply the flags on all the domains.
> +
> + Opcode can be one of the following:
> + ::
> +
> + = Update the assignment to match the MBM event.
> + + Assign a new MBM event without impacting existing assignments.
> + - Unassign a MBM event from currently assigned events.
> +
> + Examples:
> + Initial group status:
> + ::
> +
> + # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> + non_default_ctrl_mon_grp//0=tl;1=tl;
> + non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
> + //0=tl;1=tl;
> + /child_default_mon_grp/0=tl;1=tl;
> +
> + To update the default group to assign only total MBM event on domain 0:
> + ::
> +
> + # echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +
> + Assignment status after the update:
> + ::
> +
> + # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> + non_default_ctrl_mon_grp//0=tl;1=tl;
> + non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
> + //0=t;1=tl;
> + /child_default_mon_grp/0=tl;1=tl;
> +
> + To update the MON group child_default_mon_grp to remove total MBM event on domain 1:
> + ::
> +
> + # echo "/child_default_mon_grp/1-t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +
> + Assignment status after the update:
> + ::
> +
> + $ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control

Please be consistent by always using "# cat", not sometimes "$ cat" as above.

> + non_default_ctrl_mon_grp//0=tl;1=tl;
> + non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
> + //0=t;1=tl;
> + /child_default_mon_grp/0=tl;1=l;
> +
> + To update the MON group non_default_ctrl_mon_grp/child_non_default_mon_grp to unassign
> + both local and total MBM events on domain 1:
> + ::
> +
> + # echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/1=_" >
> + /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +
> + Assignment status after the update:
> + ::
> +

Missing "# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control"

> + non_default_ctrl_mon_grp//0=tl;1=tl;
> + non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
> + //0=t;1=tl;
> + /child_default_mon_grp/0=tl;1=l;
> +
> + To update the default group to add a local MBM event domain 0.
> + ::
> +
> + # echo "//0+l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +
> + Assignment status after the update:
> + ::
> +
> + # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> + non_default_ctrl_mon_grp//0=tl;1=tl;
> + non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
> + //0=tl;1=tl;
> + /child_default_mon_grp/0=tl;1=l;
> +
> + To update the non default CTRL_MON group non_default_ctrl_mon_grp to unassign all the
> + MBM events on all the domains.
> + ::
> +
> + # echo "non_default_ctrl_mon_grp//*=_" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> +
> + Assignment status after the update:
> + ::
> +
> + #cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control

Please be consistent with spacing "# cat" vs "#cat". This is very noticeable when
viewing the formatted docs.

> + non_default_ctrl_mon_grp//0=_;1=_;
> + non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
> + //0=tl;1=tl;
> + /child_default_mon_grp/0=tl;1=l;
> +
> "max_threshold_occupancy":
> Read/write file provides the largest value (in
> bytes) at which a previously used LLC_occupancy
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index a6f40d3115f4..e8d6a430dc4a 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -74,6 +74,16 @@
> */
> #define MBM_EVENT_ARRAY_INDEX(_event) ((_event) - 2)
>
> +/*
> + * Assignment flags for mbm_cntr_assign feature
> + */
> +enum {
> + ASSIGN_NONE = 0,
> + ASSIGN_TOTAL = BIT(QOS_L3_MBM_TOTAL_EVENT_ID),
> + ASSIGN_LOCAL = BIT(QOS_L3_MBM_LOCAL_EVENT_ID),
> + ASSIGN_INVALID,
> +};
> +
> /**
> * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that
> * aren't marked nohz_full
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index cf92ceb0f05e..6095146e3ba4 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1040,6 +1040,236 @@ static int rdtgroup_mbm_assign_control_show(struct kernfs_open_file *of,
> return 0;
> }
>
> +static int rdtgroup_str_to_mon_state(char *flag)
> +{
> + int i, mon_state = ASSIGN_NONE;
> +
> + for (i = 0; i < strlen(flag); i++) {
> + switch (*(flag + i)) {
> + case 't':
> + mon_state |= ASSIGN_TOTAL;
> + break;
> + case 'l':
> + mon_state |= ASSIGN_LOCAL;
> + break;
> + case '_':
> + return ASSIGN_NONE;
> + default:
> + return ASSIGN_INVALID;
> + }
> + }
> +
> + return mon_state;
> +}
> +
> +static struct rdtgroup *rdtgroup_find_grp_by_name(enum rdt_group_type rtype,
> + char *p_grp, char *c_grp)
> +{
> + struct rdtgroup *rdtg, *crg;
> +
> + if (rtype == RDTCTRL_GROUP && *p_grp == '\0') {
> + return &rdtgroup_default;
> + } else if (rtype == RDTCTRL_GROUP) {
> + list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list)
> + if (!strcmp(p_grp, rdtg->kn->name))
> + return rdtg;
> + } else if (rtype == RDTMON_GROUP) {
> + list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
> + if (!strcmp(p_grp, rdtg->kn->name)) {
> + list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
> + mon.crdtgrp_list) {
> + if (!strcmp(c_grp, crg->kn->name))
> + return crg;
> + }
> + }
> + }
> + }
> +
> + return NULL;
> +}
> +
> +static int rdtgroup_process_flags(struct rdt_resource *r,
> + enum rdt_group_type rtype,
> + char *p_grp, char *c_grp, char *tok)
> +{
> + int op, mon_state, assign_state, unassign_state;
> + char *dom_str, *id_str, *op_str;
> + struct rdt_mon_domain *d;
> + struct rdtgroup *rdtgrp;
> + unsigned long dom_id;
> + int ret, found = 0;
> +
> + rdtgrp = rdtgroup_find_grp_by_name(rtype, p_grp, c_grp);
> +
> + if (!rdtgrp) {
> + rdt_last_cmd_puts("Not a valid resctrl group\n");
> + return -EINVAL;
> + }
> +
> +next:
> + if (!tok || tok[0] == '\0')
> + return 0;
> +
> + /* Start processing the strings for each domain */
> + dom_str = strim(strsep(&tok, ";"));
> +
> + op_str = strpbrk(dom_str, "=+-");
> +
> + if (op_str) {
> + op = *op_str;
> + } else {
> + rdt_last_cmd_puts("Missing operation =, +, - character\n");
> + return -EINVAL;
> + }
> +
> + id_str = strsep(&dom_str, "=+-");
> +
> + /* Check for domain id '*' which means all domains */
> + if (id_str && *id_str == '*') {
> + d = NULL;
> + goto check_state;
> + } else if (!id_str || kstrtoul(id_str, 10, &dom_id)) {
> + rdt_last_cmd_puts("Missing domain id\n");
> + return -EINVAL;
> + }
> +
> + /* Verify if the dom_id is valid */
> + list_for_each_entry(d, &r->mon_domains, hdr.list) {
> + if (d->hdr.id == dom_id) {
> + found = 1;
> + break;
> + }
> + }
> +
> + if (!found) {
> + rdt_last_cmd_printf("Invalid domain id %ld\n", dom_id);
> + return -EINVAL;
> + }
> +
> +check_state:
> + mon_state = rdtgroup_str_to_mon_state(dom_str);
> +
> + if (mon_state == ASSIGN_INVALID) {
> + rdt_last_cmd_puts("Invalid assign flag\n");
> + goto out_fail;
> + }
> +
> + assign_state = 0;
> + unassign_state = 0;
> +
> + switch (op) {
> + case '+':
> + if (mon_state == ASSIGN_NONE) {
> + rdt_last_cmd_puts("Invalid assign opcode\n");
> + goto out_fail;
> + }
> + assign_state = mon_state;
> + break;
> + case '-':
> + if (mon_state == ASSIGN_NONE) {
> + rdt_last_cmd_puts("Invalid assign opcode\n");
> + goto out_fail;
> + }
> + unassign_state = mon_state;
> + break;
> + case '=':
> + assign_state = mon_state;
> + unassign_state = (ASSIGN_TOTAL | ASSIGN_LOCAL) & ~assign_state;
> + break;
> + default:
> + break;
> + }
> +
> + if (unassign_state & ASSIGN_TOTAL) {
> + ret = rdtgroup_unassign_cntr_event(r, rdtgrp, d, QOS_L3_MBM_TOTAL_EVENT_ID);
> + if (ret)
> + goto out_fail;
> + }
> +
> + if (unassign_state & ASSIGN_LOCAL) {
> + ret = rdtgroup_unassign_cntr_event(r, rdtgrp, d, QOS_L3_MBM_LOCAL_EVENT_ID);
> + if (ret)
> + goto out_fail;
> + }
> +
> + if (assign_state & ASSIGN_TOTAL) {
> + ret = rdtgroup_assign_cntr_event(r, rdtgrp, d, QOS_L3_MBM_TOTAL_EVENT_ID);
> + if (ret)
> + goto out_fail;
> + }
> +
> + if (assign_state & ASSIGN_LOCAL) {
> + ret = rdtgroup_assign_cntr_event(r, rdtgrp, d, QOS_L3_MBM_LOCAL_EVENT_ID);
> + if (ret)
> + goto out_fail;
> + }
> +
> + goto next;
> +
> +out_fail:

Is it possible to print a message to the command status to give some details about which
request failed? I am wondering about a scenario where a user changes multiple domains of
multiple groups, since the operation does not undo changes, it will fail without information
to user space about which setting triggered the failure and which settings succeeded.
This is similar to what is done when user attempts to move several tasks ... the error will
indicate which task triggered failure so that user space knows what completed successfully.

> +
> + return -EINVAL;
> +}
> +
> +static ssize_t rdtgroup_mbm_assign_control_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes, loff_t off)
> +{
> + struct rdt_resource *r = of->kn->parent->priv;
> + char *token, *cmon_grp, *mon_grp;
> + enum rdt_group_type rtype;
> + int ret;
> +
> + /* Valid input requires a trailing newline */
> + if (nbytes == 0 || buf[nbytes - 1] != '\n')
> + return -EINVAL;
> +
> + buf[nbytes - 1] = '\0';
> +
> + cpus_read_lock();
> + mutex_lock(&rdtgroup_mutex);
> +
> + if (!resctrl_arch_mbm_cntr_assign_enabled(r)) {
> + rdt_last_cmd_puts("mbm_cntr_assign mode is not enabled\n");

Writing to last_cmd_status_buf here ...

> + mutex_unlock(&rdtgroup_mutex);
> + cpus_read_unlock();
> + return -EINVAL;
> + }
> +
> + rdt_last_cmd_clear();

... but initializing buffer here.
Sidenote: This was an issue before. If you receive comments about
items in patches, please do check if those comments apply to other patches also.

> +
> + while ((token = strsep(&buf, "\n")) != NULL) {
> + if (strstr(token, "/")) {
> + /*
> + * The write command follows the following format:
> + * “<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>”
> + * Extract the CTRL_MON group.
> + */
> + cmon_grp = strsep(&token, "/");
> +
> + /*
> + * Extract the MON_GROUP.
> + * strsep returns empty string for contiguous delimiters.
> + * Empty mon_grp here means it is a RDTCTRL_GROUP.
> + */
> + mon_grp = strsep(&token, "/");
> +
> + if (*mon_grp == '\0')
> + rtype = RDTCTRL_GROUP;
> + else
> + rtype = RDTMON_GROUP;
> +
> + ret = rdtgroup_process_flags(r, rtype, cmon_grp, mon_grp, token);
> + if (ret)
> + break;
> + }
> + }
> +
> + mutex_unlock(&rdtgroup_mutex);
> + cpus_read_unlock();
> +
> + return ret ?: nbytes;
> +}
> +
> #ifdef CONFIG_PROC_CPU_RESCTRL
>
> /*
> @@ -2328,9 +2558,10 @@ static struct rftype res_common_files[] = {
> },
> {
> .name = "mbm_assign_control",
> - .mode = 0444,
> + .mode = 0644,
> .kf_ops = &rdtgroup_kf_single_ops,
> .seq_show = rdtgroup_mbm_assign_control_show,
> + .write = rdtgroup_mbm_assign_control_write,
> },
> {
> .name = "cpus_list",

On a high level this looks ok but this code needs to be more robust. This will parse
data from user space that may include all kinds of input ... think malicious user or
a buggy script. I am not able to test this code but I tried to work through what will
happen under some wrong input and found some issues. For example, if user space provides
input like '//\n' then rdtgroup_process_flags() will be called with token == NULL. This will
result in rdtgroup_process_flags() returning "success", but fortunately do nothing, for
this invalid input. A more severe example is with input like '//0=\n', from what I can tell
this will result in rdtgroup_str_to_mon_state() called with dom_str==NULL that will treat
this as ASSIGN_NONE and proceed as if user provided '//0=_'.
This was just some scenarios with basic input that could be typos, no real stress tests.
I stopped here though since I believe it is already clear this needs to be more robust.
Please do test this interface by exercising it with invalid input and corner cases.

Reinette