Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
From: John Stultz
Date: Wed May 13 2026 - 00:52:05 EST
On Mon, May 11, 2026 at 5:07 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> Change fair/cgroup to a single runqueue.
>
> Infamously fair/cgroup isn't working for a number of people; typically
> the complaint is latencies and/or overhead. The latency issue is due
> to the intermediate entries that represent a combination of tasks and
> thereby obfuscate the runnability of tasks.
>
> The approach here is to leave the cgroup hierarchy as is; including
> the intermediate enqueue/dequeue but move the actual EEVDF runqueue
> outside. This means things like the shares_weight approximation are
> fully preserved.
>
> That is, given a hierarchy like:
>
> R
> |
> se--G1
> / \
> G2--se se--G3
> / \ |
> T1--se se--T2 se--T3
>
> This is fully maintained for load tracking, however the EEVDF parts of
> cfs_rq/se go unused for the intermediates and are instead connected
> like:
>
> _R_
> / | \
> T1 T2 T3
>
> Since the effective weight of the entities is determined by the
> hierarchy, this gets recomputed on enqueue,set_next_task and tick.
>
> Notably, the effective weight (se->h_load) is computed from the
> hierarchical fraction: se->load / cfs_rq->load.
>
> Since EEVDF is now exclusive operating on rq->cfs, it needs to
> consider cfs_rq->h_nr_queued rather than cfs_rq->nr_queued. Similarly,
> only tasks can get delayed, simplifying some of the cgroup cleanup.
>
> One place where additional information was required was
> set_next_task() / put_prev_task(), where we need to track 'current'
> both in the hierarchical sense (cfs_rq->h_curr) and in the flat sense
> (cfs_rq->curr).
>
> As a result of only having a single level to pick from, much of the
> complications in pick_next_task() and preemption go away.
>
> Since many of the hierarchical operations are still there, this won't
> immediately fix the performance issues, but hopefully it will fix some
> of the latency issues.
>
> TODO: split struct cfs_rq / struct sched_entity
> TODO: try and get rid of h_curr
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
I know Vincent was having some perf troubles with this patch, but
booting on a 64 vCPU qemu environment, I'm seeing:
[ 5.688490] Oops: divide error: 0000 [#1] SMP NOPTI
[ 5.689457] CPU: 47 UID: 0 PID: 0 Comm: swapper/47 Not tainted
7.1.0-rc2-00026-g82a8ec6fb3f9 #38 PREEMPT(full)
[ 5.689457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.17.0-debian-1.17.0-1 04/01/2014
[ 5.689457] RIP: 0010:wakeup_preempt_fair+0x1b7/0x430
[ 5.689457] Code: 74 0b 48 8b 52 28 48 39 d0 48 0f 47 c2 48 8b b9
90 00 00 00 48 8b b1 08 01 00 00 48 81 ff 00 00 10 00 74 09 48 c1 e0
14 31 9
[ 5.689457] RSP: 0000:ffffc9000021fd70 EFLAGS: 00010046
[ 5.689457] RAX: 000002ab98000000 RBX: ffff8881b8e2db40 RCX: ffffffff83022a80
[ 5.689457] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 5.689457] RBP: 0000000000000001 R08: ffff88810cb14380 R09: ffffffff83022b00
[ 5.689457] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
[ 5.689457] R13: 0000000000000000 R14: ffff88810cb14300 R15: ffff8881b8e2da00
[ 5.689457] FS: 0000000000000000(0000) GS:ffff888235c2e000(0000)
knlGS:0000000000000000
[ 5.689457] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.689457] CR2: 0000000000000000 CR3: 000000000304c001 CR4: 0000000000370ef0
[ 5.689457] Call Trace:
[ 5.689457] <TASK>
[ 5.689457] wakeup_preempt+0xa8/0xd0
[ 5.689457] attach_one_task+0xec/0x150
[ 5.689457] __schedule+0x1ad8/0x21c0
[ 5.689457] schedule_idle+0x22/0x40
[ 5.689457] cpu_startup_entry+0x29/0x30
[ 5.689457] start_secondary+0xf7/0x100
[ 5.689457] common_startup_64+0x13e/0x148
[ 5.689457] </TASK>
[ 5.689457] Dumping ftrace buffer:
[ 5.689457] (ftrace buffer empty)
[ 5.689457] ---[ end trace 0000000000000000 ]---
[ 5.689457] RIP: 0010:wakeup_preempt_fair+0x1b7/0x430
[ 5.689457] Code: 74 0b 48 8b 52 28 48 39 d0 48 0f 47 c2 48 8b b9
90 00 00 00 48 8b b1 08 01 00 00 48 81 ff 00 00 10 00 74 09 48 c1 e0
14 31 9
[ 5.689457] RSP: 0000:ffffc9000021fd70 EFLAGS: 00010046
[ 5.689457] RAX: 000002ab98000000 RBX: ffff8881b8e2db40 RCX: ffffffff83022a80
[ 5.689457] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 5.689457] RBP: 0000000000000001 R08: ffff88810cb14380 R09: ffffffff83022b00
[ 5.689457] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
[ 5.689457] R13: 0000000000000000 R14: ffff88810cb14300 R15: ffff8881b8e2da00
[ 5.689457] FS: 0000000000000000(0000) GS:ffff888235c2e000(0000)
knlGS:0000000000000000
[ 5.689457] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.689457] CR2: 0000000000000000 CR3: 000000000304c001 CR4: 0000000000370ef0
[ 5.689457] Kernel panic - not syncing: Fatal exception
Which I bisected down to this last patch in the series.
faddr2line gave me:
__calc_delta at kernel/sched/fair.c:290
(inlined by) calc_delta_fair at kernel/sched/fair.c:300
(inlined by) update_protect_slice at kernel/sched/fair.c:1070
(inlined by) wakeup_preempt_fair at kernel/sched/fair.c:9193
This usually trips as the ww_mutex selftest starts at bootup.
Unfortunately I still see it with the add-on changes you proposed to K
Prateek's feedback here.
I'll try to narrow it down further tomorrow.
thanks
-john