Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue

From: John Stultz

Date: Wed May 13 2026 - 21:38:12 EST


On Tue, May 12, 2026 at 10:00 PM John Stultz <jstultz@xxxxxxxxxx> wrote:
> On Tue, May 12, 2026 at 9:51 PM John Stultz <jstultz@xxxxxxxxxx> wrote:
> >
> > On Mon, May 11, 2026 at 5:07 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > >
> > > Change fair/cgroup to a single runqueue.
> > >
> ...
> >
> > I know Vincent was having some perf troubles with this patch, but
> > booting on a 64 vCPU qemu environment, I'm seeing:
> >
> > [ 5.688490] Oops: divide error: 0000 [#1] SMP NOPTI
> > [ 5.689457] CPU: 47 UID: 0 PID: 0 Comm: swapper/47 Not tainted
> > 7.1.0-rc2-00026-g82a8ec6fb3f9 #38 PREEMPT(full)
> > [ 5.689457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > BIOS 1.17.0-debian-1.17.0-1 04/01/2014
> > [ 5.689457] RIP: 0010:wakeup_preempt_fair+0x1b7/0x430
> > [ 5.689457] Code: 74 0b 48 8b 52 28 48 39 d0 48 0f 47 c2 48 8b b9
> > 90 00 00 00 48 8b b1 08 01 00 00 48 81 ff 00 00 10 00 74 09 48 c1 e0
> > 14 31 9
> > [ 5.689457] RSP: 0000:ffffc9000021fd70 EFLAGS: 00010046
> > [ 5.689457] RAX: 000002ab98000000 RBX: ffff8881b8e2db40 RCX: ffffffff83022a80
> > [ 5.689457] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > [ 5.689457] RBP: 0000000000000001 R08: ffff88810cb14380 R09: ffffffff83022b00
> > [ 5.689457] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
> > [ 5.689457] R13: 0000000000000000 R14: ffff88810cb14300 R15: ffff8881b8e2da00
> > [ 5.689457] FS: 0000000000000000(0000) GS:ffff888235c2e000(0000)
> > knlGS:0000000000000000
> > [ 5.689457] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 5.689457] CR2: 0000000000000000 CR3: 000000000304c001 CR4: 0000000000370ef0
> > [ 5.689457] Call Trace:
> > [ 5.689457] <TASK>
> > [ 5.689457] wakeup_preempt+0xa8/0xd0
> > [ 5.689457] attach_one_task+0xec/0x150
> > [ 5.689457] __schedule+0x1ad8/0x21c0
> > [ 5.689457] schedule_idle+0x22/0x40
> > [ 5.689457] cpu_startup_entry+0x29/0x30
> > [ 5.689457] start_secondary+0xf7/0x100
> > [ 5.689457] common_startup_64+0x13e/0x148
> > [ 5.689457] </TASK>
> > [ 5.689457] Dumping ftrace buffer:
> > [ 5.689457] (ftrace buffer empty)
> > [ 5.689457] ---[ end trace 0000000000000000 ]---
> > [ 5.689457] RIP: 0010:wakeup_preempt_fair+0x1b7/0x430
> > [ 5.689457] Code: 74 0b 48 8b 52 28 48 39 d0 48 0f 47 c2 48 8b b9
> > 90 00 00 00 48 8b b1 08 01 00 00 48 81 ff 00 00 10 00 74 09 48 c1 e0
> > 14 31 9
> > [ 5.689457] RSP: 0000:ffffc9000021fd70 EFLAGS: 00010046
> > [ 5.689457] RAX: 000002ab98000000 RBX: ffff8881b8e2db40 RCX: ffffffff83022a80
> > [ 5.689457] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > [ 5.689457] RBP: 0000000000000001 R08: ffff88810cb14380 R09: ffffffff83022b00
> > [ 5.689457] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
> > [ 5.689457] R13: 0000000000000000 R14: ffff88810cb14300 R15: ffff8881b8e2da00
> > [ 5.689457] FS: 0000000000000000(0000) GS:ffff888235c2e000(0000)
> > knlGS:0000000000000000
> > [ 5.689457] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 5.689457] CR2: 0000000000000000 CR3: 000000000304c001 CR4: 0000000000370ef0
> > [ 5.689457] Kernel panic - not syncing: Fatal exception
> >
> > Which I bisected down to this last patch in the series.
> >
> > faddr2line gave me:
> > __calc_delta at kernel/sched/fair.c:290
> > (inlined by) calc_delta_fair at kernel/sched/fair.c:300
> > (inlined by) update_protect_slice at kernel/sched/fair.c:1070
> > (inlined by) wakeup_preempt_fair at kernel/sched/fair.c:9193
> >
> > This usually trips as the ww_mutex selftest starts at bootup.
> >
> > Unfortunately I still see it with the add-on changes you proposed to K
> > Prateek's feedback here.
> >
> > I'll try to narrow it down further tomorrow.
>
> As karma would have it, this does seem to depend on CONFIG_SCHED_PROXY_EXEC. :)
> I'm guessing the switch in calc_delta_fair() to use se->h_load is
> uncovering something proxy isn't handling properly with that value.
>

So looking at the callstack when I see the failure:
proxy_find_task()
proxy_force_return()
proxy_resched_idle() <- sets rq->donor to idle
attach_one_task()
wakeup_preempt()
wakeup_preempt_fair()
update_protect_slice() <- called with the donor's se
calc_delta_fair()
__calc_delta() <- div by zero

Basically we end up in wakeup_preempt_fair() with rq->donor ==
rq->idle because we earlier called proxy_resched_idle().

Without proxy, if we call wakeup_preempt_fair() when rq->donor (and
rq->curr) is rq->idle, we usually end up taking the `if
(test_tsk_need_resched(rq->curr))` early exit and we don't hit this.

But with proxy, rq->curr isn't idle at this point. So we end up
continuing on. Despite the se_is_idle(se) checks (where se is the
&donor->se), those don't catch because rq->idle (maybe unintuitvely)
has a SCHED_NORMAL policy.

So we end up getting down to update_protect_slice() with rq->idle as
the se and the idle h_load.weight is zero.

Not sure what the best approach might be, but adding:
if (donor == rq->idle) {
/* don't give rq->idle slice protection */
preempt_action = PREEMPT_WAKEUP_SHORT;
goto preempt;
}

similar to the `if (cse_is_idle && !pse_is_idle)` check seems to resolve this.

Anyway, if you have thoughts on better approach, I'd be happy to work
up a patch to add on top of this one.

thanks
-john