[PATCH v2 0/6] cgroup/cpuset: Support remote isolated partitions

From: Waiman Long
Date: Wed May 31 2023 - 12:35:38 EST

- [v1] https://lore.kernel.org/lkml/20230412153758.3088111-1-longman@xxxxxxxxxx/
- Dropped the special "isolcpus" partition in v1
- Add the root only "cpuset.cpus.reserve" control file for reserving
CPUs used for remote isolated partitions.
- Update the test_cpuset_prs.sh test script and documentation

This patch series introduces a new category of cpuset partition called
remote partitions. The existing partition category where the partition
roots have to be clustered around the root cgroup in a hierarchical way
is now referred to as adjacent partitions.

A remote partition can be formed far from the root cgroup with no
partition root parent. The only commonality is that the CPUs that are
used in the partition as specified in "cpuset.cpus" have to be present
in the "cpuset.cpus" of all its ancestors.

It is relatively rare to have applications that require creation of
a separate scheduling domain (root). However, it is more common to
have applications that require the use of isolated CPUs (isolated),
e.g. DPDK. One can use the "isolcpus" or "nohz_full" boot command options
to get that statically. Of course, the "isolated" partition is another
way to achieve that dynamically.

Modern container orchestration tools like Kubernetes use the cgroup
hierarchy to manage different containers. And it is relying on other
middleware like systemd to help managing it. If a container needs to
use isolated CPUs, it is hard to get those with the adjacent partitions
as it will require the administrative parent cgroup to be a partition
root too which tool like systemd may not be ready to manage.

With this patch series, a new root cgroup only "cpuset.cpus.reserve"
file is added to specify the set of CPUs that can be used in partitions
(whether remote or adjacent). To create a remote partition, the set
of CPUs to be used in that partition (the "cpuset.cpus" file of the
partition root) has to be reserved by manually adding them to that
control file first. Then that partition can be activated by writing
"isolated" into its "cpuset.cpus.partition". CPU reservation of adjacent
partitions is done automatically without touching "cpuset.cpus.reserve"
at all.

Currently only remote isolated partitions are supported, we could
support a scheduling partition ("root") in the future if the need arises.
Additional isolation attributes like those with the "isolcpus" or "nohz"
boot command line options may be supported in the isolated partitions
in the future.

Waiman Long (6):
cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE
cgroup/cpuset: Improve temporary cpumasks handling
cgroup/cpuset: Add cpuset.cpus.reserve for top cpuset
cgroup/cpuset: Introduce remote isolated partition
cgroup/cpuset: Documentation update for partition
cgroup/cpuset: Extend test_cpuset_prs.sh to test remote partition

Documentation/admin-guide/cgroup-v2.rst | 92 ++-
kernel/cgroup/cpuset.c | 749 +++++++++++++++---
.../selftests/cgroup/test_cpuset_prs.sh | 403 ++++++----
3 files changed, 988 insertions(+), 256 deletions(-)