Re: [PATCH v3 3/7] sched/fair: Add cgroup_mode: max

From: Waiman Long

Date: Wed Jun 10 2026 - 12:00:57 EST

On 6/10/26 11:09 AM, Waiman Long wrote:

On 6/5/26 8:40 AM, Peter Zijlstra wrote:

In order to avoid the average CPU fraction avg(F_g_n) becoming tiny '1/N',
assume each cgroup is maximally concurrent and distrubute 'N*weight', such
that:

    F_g_n' = N * F_g_n

Giving:

    avg(F_g_n') = N*avg(F_g_n) ~ N * 1/N = 1

And while this sounds like it solves things, remember what that ~ meant. There
is the corner case when a cgroup is minimally loaded, eg a single runnable
task, therefore limit the CPU fraction to that of a nice -20 task to avoid
getting too much load.

This last bit is what makes it different from a previous proposal to allow
raising cpu.weight to '100 * N', that would not limit the mininal concurrency
case and results in a very large F_g_n. And just like F_g_n << 1 is
problematic, so is F_g_n >> 1 for the exact same reasons (it would drown the
kthreads, but it also risks overflowing the load values).

So while this might appear to be a better scheme than the current default
scheme, it doesn't really handle less than maximal concurrency nicely -- it
clips and introduces artificially large weights. So where the traditional SMP
mode works well when nr_tasks << nr_cpus, MAX doesn't work well in that regime
and vice-versa.

The meaning of "cpu.weight" would be: weight per allowed CPU.

Included for completeness (and infrastructure).

Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
---
include/linux/cpuset.h |    6 +++++
kernel/cgroup/cpuset.c |   15 ++++++++++++++
kernel/sched/debug.c   |    1
kernel/sched/fair.c    |   52 ++++++++++++++++++++++++++++++++++++++++++++-----
4 files changed, 69 insertions(+), 5 deletions(-)

--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -80,6 +80,7 @@ extern void lockdep_assert_cpuset_lock_h
extern void cpuset_cpus_allowed_locked(struct task_struct *p, struct cpumask *mask);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
+extern int cpuset_num_cpus(struct cgroup *cgroup);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
#define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void);
@@ -216,6 +217,11 @@ static inline bool cpuset_cpus_allowed_f
      return false;
}
+static inline int cpuset_num_cpus(struct cgroup *cgroup)
+{
+    return num_online_cpus();
+}
+
static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
{
      return node_possible_map;
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4116,6 +4116,21 @@ bool cpuset_cpus_allowed_fallback(struct
      return changed;
}
+int cpuset_num_cpus(struct cgroup *cgrp)
+{
+    int nr = num_online_cpus();
+    struct cpuset *cs;
+
+    if (is_in_v2_mode()) {
+        guard(rcu)();
+        cs = css_cs(cgroup_e_css(cgrp, &cpuset_cgrp_subsys));
+        if (cs)
+            nr = cpumask_weight(cs->effective_cpus);
+    }
+
+    return nr;
+}

I just have a question about cgroup v1 support. I am assuming that cgroup v1 without the cpuset_v2_mode mount option is not supported. To fully support cgroup v1, you may have to use guarantee_active_cpus() to return the actual set of CPUs that the task can run on. Also there is a caveat about the arm64 specific task_cpu_possible_mask() for certain arm64 CPUs. That is for 32-bit binary running on 64-bit core which are allowed only on a selected subset of cores within the CPU.

This is probably not what you want to focus on right now, but it will be good to have a comment to list items that are not fully supported here.

FYI, you may have to take the callback_lock to ensure the stability of the effective_cpus mask.

Cheers,
Longman