Re: [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of the groups

From: Moger, Babu
Date: Wed Feb 19 2025 - 19:34:59 EST


Hi Dave,

On 2/19/2025 10:07 AM, Dave Martin wrote:
Hi,

On Wed, Jan 22, 2025 at 02:20:31PM -0600, Babu Moger wrote:
When mbm_cntr_assign mode is enabled, users can designate which of the MBM
events in the CTRL_MON or MON groups should have counters assigned.

Provide an interface for assigning MBM events by writing to the file:
/sys/fs/resctrl/info/L3_MON/mbm_assign_control. Using this interface,
events can be assigned or unassigned as needed.

Format is similar to the list format with addition of opcode for the
assignment operation.
"<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"

[...]

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 6e29827239e0..299839bcf23f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1050,6 +1050,244 @@ static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,

[...]

+static ssize_t resctrl_mbm_assign_control_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct rdt_resource *r = of->kn->parent->priv;
+ char *token, *cmon_grp, *mon_grp;
+ enum rdt_group_type rtype;
+ int ret;
+
+ /* Valid input requires a trailing newline */
+ if (nbytes == 0 || buf[nbytes - 1] != '\n')
+ return -EINVAL;
+
+ buf[nbytes - 1] = '\0';
+
+ cpus_read_lock();
+ mutex_lock(&rdtgroup_mutex);
+
+ rdt_last_cmd_clear();
+
+ if (!resctrl_arch_mbm_cntr_assign_enabled(r)) {
+ rdt_last_cmd_puts("mbm_cntr_assign mode is not enabled\n");
+ mutex_unlock(&rdtgroup_mutex);
+ cpus_read_unlock();
+ return -EINVAL;
+ }
+
+ while ((token = strsep(&buf, "\n")) != NULL) {
+ /*
+ * The write command follows the following format:
+ * “<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>”
+ * Extract the CTRL_MON group.
+ */
+ cmon_grp = strsep(&token, "/");
+

As when reading this file, I think that the data can grow larger than a
page and get split into multiple write() calls.

I don't currently think the file needs to be redesigned, but there are
some concerns about how userspace will work with it that need to be
sorted out.

Every monitoring group can contribute a line to this file:

CTRL_GROUP / MON_GROUP / DOMAIN = [t][l] [ ; DOMAIN = [t][l] ]* LF

so, 2 * (NAME_MAX + 1) + NUM_DOMAINS * 5 - 1 + 1

NAME_MAX on Linux is 255, so with, say, up to 16 domains, that's about
600 bytes per monitoring group in the worst case.

We don't need to have many control and monitoring groups for this to
grow potentially over 4K.


We could simply place a limit on how much userspace is allowed to write
to this file in one go, although this restriction feels difficult for
userspace to follow -- but maybe this is workable in the short term, on
current systems (?)

Otherwise, since we expect this interface to be written using scripting
languages, I think we need to be prepared to accept fully-buffered
I/O. That means that the data may be cut at random places, not
necessarily at newlines. (For smaller files such as schemata this is
not such an issue, since the whole file is likely to be small enough to
fit into the default stdio buffers -- this is how sysfs gets away with
it IIUC.)

For fully-buffered I/O, we may have to cache an incomplete line in
between write() calls. If there is a dangling incomplete line when the
file is closed then it is hard to tell userspace, because people often
don't bother to check the return value of close(), fclose() etc.
However, since it's an ABI violation for userspace to end this file
with a partial line, I think it's sufficient to report that via
last_cmd_status. (Making close() return -EIO still seems a good idea
though, just in case userspace is listening.)

Seems like we can add a check in resctrl_mbm_assign_control_write() to compare nbytes > PAGE_SIZE.

But do we really need this? I have no way of testing this. Help me understand.

All these file operations go thru generic call kernfs_fop_write_iter(). Doesn't it take care of buffer check and overflow?



I hacked up something a bit like this so that schemata could be written
interactively from the shell, so I can try to port that onto this series
as an illustration, if it helps.

Cheers
---Dave


Thanks
Babu