Re: [PATCH v3 7/7] sched/eevdf: Move to a single runqueue

From: Chen, Yu C

Date: Fri Jun 19 2026 - 23:54:51 EST


On 6/5/2026 8:40 PM, Peter Zijlstra wrote:
Change fair/cgroup to a single runqueue.

Infamously fair/cgroup isn't working for a number of people; typically
the complaint is latencies and/or overhead. The latency issue is due
to the intermediate entries that represent a combination of tasks and
thereby obfuscate the runnability of tasks.

The approach here is to leave the cgroup hierarchy as is; including
the intermediate enqueue/dequeue but move the actual EEVDF runqueue
outside. This means things like the shares_weight approximation are
fully preserved.

That is, given a hierarchy like:

R
|
se--G1
/ \
G2--se se--G3
/ \ |
T1--se se--T2 se--T3

This is fully maintained for load tracking, however the EEVDF parts of
cfs_rq/se go unused for the intermediates and are instead connected
like:

_R_
/ | \
T1 T2 T3

Since the effective weight of the entities is determined by the
hierarchy, this gets recomputed on enqueue,set_next_task and tick.

Notably, the effective weight (se->h_load) is computed from the
hierarchical fraction: se->load / cfs_rq->load.

Since EEVDF is now exclusively operating on rq->cfs, it needs to
consider cfs_rq->h_nr_queued rather than cfs_rq->nr_queued. Similarly,
only tasks can get delayed, simplifying some of the cgroup cleanup.

One place where additional information was required was
set_next_task() / put_prev_task(), where we need to track 'current'
both in the hierarchical sense (cfs_rq->h_curr) and in the flat sense
(cfs_rq->curr).

As a result of only having a single level to pick from, much of the
complications in pick_next_task() and preemption go away.

Since many of the hierarchical operations are still there, this won't
immediately fix the performance issues, but hopefully it will fix some
of the latency issues.

TODO: split struct cfs_rq / struct sched_entity
TODO: try and get rid of h_curr

Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>

A divide-by-zero crash is observed when running hackbench:

[14697.488452] CPU: 112 UID: 0 PID: 124791 Comm: hackbench Not tainted 7.1.0-rc2+
[14697.492627] RIP: 0010:propagate_entity_load_avg+0x35f/0x3e0
[14697.506799] <TASK>
[14697.507411] __dequeue_task+0x2b4/0xc70
[14697.508677] dequeue_task_fair+0x36/0x370
[14697.509047] dequeue_task+0x101/0x2f0
[14697.509426] __schedule+0x1b1/0x1a00
[14697.510868] anon_pipe_read+0x3da/0x450
[14697.511400] vfs_read+0x361/0x390
[14697.512053] __x64_sys_read+0x19/0x30

The divide-by-zero happens here:

if (scale_load_down(gcfs_rq->load.weight)) {
load_sum = div_u64(gcfs_rq->avg.load_sum,
scale_load_down(gcfs_rq->load.weight));
}

gcfs_rq->load.weight is an insane large value and is truncated
to the lower 32 bits by div_u64, which happen to be 0.

Using AI for investigation, the cause is a u32 overflow in
update_tg_cfs_runnable(), and flat pickup became a victim when using
tg_tasks():

u32 new_sum, divider;
...
new_sum = se->avg.runnable_avg * divider; <-- boom

The following sequence shows how this triggers the crash:

propagate_entity_load_avg()
update_tg_cfs_runnable() # u32 overflow corrupts runnable_sum

__update_load_avg_cfs_rq()
___update_load_avg() # computes insane runnable_avg
update_tg_load_avg() # propagates to tg->runnable_avg

update_cfs_group()
calc_concur_shares()
tg_tasks() # long-to-int truncation, negative nr
reweight_entity() # corrupted se->load.weight
update_load_add() # corrupted cfs_rq->load.weight

propagate_entity_load_avg()
update_tg_cfs_load()
div_u64() # divide-by-zero

Fix by widening new_sum from u32 to u64(no need to force tg_tasks()
to return unsigned long after this fix)
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx>
---
kernel/sched/fair.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d991ea85873a..99ea51448981 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5305,7 +5305,8 @@ static inline void
update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{
long delta_sum, delta_avg = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
- u32 new_sum, divider;
+ u64 new_sum;
+ u32 divider;

/* Nothing to update */
if (!delta_avg)
@@ -5319,7 +5320,7 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf

/* Set new sched_entity's runnable */
se->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
- new_sum = se->avg.runnable_avg * divider;
+ new_sum = (u64)se->avg.runnable_avg * divider;
delta_sum = (long)new_sum - (long)se->avg.runnable_sum;
se->avg.runnable_sum = new_sum;

--
2.45.2