Re: [PATCH v3 3/7] sched/fair: Add cgroup_mode: max
From: Waiman Long
Date: Wed Jun 10 2026 - 12:00:57 EST
On 6/10/26 11:09 AM, Waiman Long wrote:
On 6/5/26 8:40 AM, Peter Zijlstra wrote:
In order to avoid the average CPU fraction avg(F_g_n) becoming tiny '1/N',
assume each cgroup is maximally concurrent and distrubute 'N*weight', such
that:
F_g_n' = N * F_g_n
Giving:
avg(F_g_n') = N*avg(F_g_n) ~ N * 1/N = 1
And while this sounds like it solves things, remember what that ~ meant. There
is the corner case when a cgroup is minimally loaded, eg a single runnable
task, therefore limit the CPU fraction to that of a nice -20 task to avoid
getting too much load.
This last bit is what makes it different from a previous proposal to allow
raising cpu.weight to '100 * N', that would not limit the mininal concurrency
case and results in a very large F_g_n. And just like F_g_n << 1 is
problematic, so is F_g_n >> 1 for the exact same reasons (it would drown the
kthreads, but it also risks overflowing the load values).
So while this might appear to be a better scheme than the current default
scheme, it doesn't really handle less than maximal concurrency nicely -- it
clips and introduces artificially large weights. So where the traditional SMP
mode works well when nr_tasks << nr_cpus, MAX doesn't work well in that regime
and vice-versa.
The meaning of "cpu.weight" would be: weight per allowed CPU.
Included for completeness (and infrastructure).
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
---
include/linux/cpuset.h | 6 +++++
kernel/cgroup/cpuset.c | 15 ++++++++++++++
kernel/sched/debug.c | 1
kernel/sched/fair.c | 52 ++++++++++++++++++++++++++++++++++++++++++++-----
4 files changed, 69 insertions(+), 5 deletions(-)
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -80,6 +80,7 @@ extern void lockdep_assert_cpuset_lock_h
extern void cpuset_cpus_allowed_locked(struct task_struct *p, struct cpumask *mask);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
+extern int cpuset_num_cpus(struct cgroup *cgroup);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
#define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void);
@@ -216,6 +217,11 @@ static inline bool cpuset_cpus_allowed_f
return false;
}
+static inline int cpuset_num_cpus(struct cgroup *cgroup)
+{
+ return num_online_cpus();
+}
+
static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
{
return node_possible_map;
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4116,6 +4116,21 @@ bool cpuset_cpus_allowed_fallback(struct
return changed;
}
+int cpuset_num_cpus(struct cgroup *cgrp)
+{
+ int nr = num_online_cpus();
+ struct cpuset *cs;
+
+ if (is_in_v2_mode()) {
+ guard(rcu)();
+ cs = css_cs(cgroup_e_css(cgrp, &cpuset_cgrp_subsys));
+ if (cs)
+ nr = cpumask_weight(cs->effective_cpus);
+ }
+
+ return nr;
+}
I just have a question about cgroup v1 support. I am assuming that cgroup v1 without the cpuset_v2_mode mount option is not supported. To fully support cgroup v1, you may have to use guarantee_active_cpus() to return the actual set of CPUs that the task can run on. Also there is a caveat about the arm64 specific task_cpu_possible_mask() for certain arm64 CPUs. That is for 32-bit binary running on 64-bit core which are allowed only on a selected subset of cores within the CPU.
This is probably not what you want to focus on right now, but it will be good to have a comment to list items that are not fully supported here.
FYI, you may have to take the callback_lock to ensure the stability of the effective_cpus mask.
Cheers,
Longman