[RFD] resctrl: reassigning a running container's CTRL_MON group

From: Peter Newman
Date: Fri Oct 07 2022 - 06:39:54 EST

Hi Reinette, Fenghua,

I'd like to talk about the tasks file interface in CTRL_MON and MON

For some background, we are using the memory-bandwidth monitoring and
allocation features of resctrl to maintain QoS on external memory
bandwidth for latency-sensitive containers to help enable batch
containers to use up leftover CPU/memory resources on a machine. We
also monitor the external memory bandwidth usage of all hosted
containers to identify ones which are misusing their latency-sensitive
CoS assignment and downgrade them to the batch CoS.

The trouble is, container manager developers working with the tasks
interface have complained that it's not usable for them because it takes
many (or an unbounded number of) passes to move all tasks from a
container over, as the list is always changing.

Our solution for them is to remove the need for moving tasks between
CTRL_MON groups. Because we are mainly using MB throttling to implement
QoS, we only need two classes of service. Therefore we've modified
resctrl to reuse existing CLOSIDs for CTRL_MON groups with identical
configurations, allowing us to create a CTRL_MON group for every
container. Instead of moving the tasks over, we only need to update
their CTRL_MON group's schemata. Another benefit for us is that we do
not need to also move all of the tasks over to a new monitoring group in
the batch CTRL_MON group, and the usage counts remain intact.

The CLOSID management rules would roughly be:

1. If an update would cause a CTRL_MON group's config to match that of
an existing group, the CTRL_MON group's CLOSID should change to that
of the existing group, where the definition of "match" is: all
control values match in all domains for all resources, as well as
the cpu masks matching.

2. If an update to a CTRL_MON group sharing a CLOSID with another group
causes that group to no longer match any others, a new CLOSID must
be allocated.

3. An update to a CTRL_MON group using a non-shared CLOSID which
continues to not match any others follows the current resctrl

Before I prepare any patches for review, I'm interested in any comments
or suggestions on the use case and solution.

Are there simpler strategies for reassigning a running container's tasks
to a different CTRL_MON group that we should be considering first?

Any concerns about the CLOSID-reusing behavior? The hope is existing
users who aren't creating identically-configured CTRL_MON groups would
be minimally impacted. Would it help if the proposed behavior were
opt-in at mount-time?