Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking

From: John Stultz

Date: Sat Mar 28 2026 - 01:44:49 EST

On Wed, Feb 18, 2026 at 11:58 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> It turns out that zero_vruntime tracking is broken when there is but a single
> task running. Current update paths are through __{en,de}queue_entity(), and
> when there is but a single task, pick_next_task() will always return that one
> task, and put_prev_set_next_task() will end up in neither function.
>
> This can cause entity_key() to grow indefinitely large and cause overflows,
> leading to much pain and suffering.
>
> Furtermore, doing update_zero_vruntime() from __{de,en}queue_entity(), which
> are called from {set_next,put_prev}_entity() has problems because:
>
> - set_next_entity() calls __dequeue_entity() before it does cfs_rq->curr = se.
> This means the avg_vruntime() will see the removal but not current, missing
> the entity for accounting.
>
> - put_prev_entity() calls __enqueue_entity() before it does cfs_rq->curr =
> NULL. This means the avg_vruntime() will see the addition *and* current,
> leading to double accounting.
>
> Both cases are incorrect/inconsistent.
>
> Noting that avg_vruntime is already called on each {en,de}queue, remove the
> explicit avg_vruntime() calls (which removes an extra 64bit division for each
> {en,de}queue) and have avg_vruntime() update zero_vruntime itself.
>
> Additionally, have the tick call avg_vruntime() -- discarding the result, but
> for the side-effect of updating zero_vruntime.

Hey all,

So in stress testing with my full proxy-exec series, I was
occasionally tripping over the situation where __pick_eevdf() returns
null which quickly crashes.

Initially I was thinking the bug was in my out of tree patches, but I
later found I could trip it with upstream as well, and I believe I
have bisected it down to this patch. Though reproduction often takes
3-4 hours, and I usually quit testing after 5 hours, so it's possible
I have some false negatives on the problem and it could have arisen
earlier.

>From a little bit of debugging (done with the full proxy exec series,
I need to re-debug with vanilla), usual symptom is that we run into a
situation where !entity_eligible(cfs_rq, curr), so curr gets set to
null (though in one case, I saw cfs_rq->curr start as null), and then
we never set best, and thus the `if (!best || ...) best = curr;`
assignment doesn't save us and we return null, and crash.

I still need to dig more into the eligibility values and also to dump
the rq to see why nothing is being found. I am running with
CONFIG_SCHED_PROXY_EXEC enabled, so there may yet be some collision
here between this change and the already upstream portions of Proxy
Exec (I'll have to do more testing to see if it reproduces without
that option enabled).

The backtrace is usually due to stress-ng stress-ng-yield test:

[ 3775.898617] BUG: kernel NULL pointer dereference, address: 0000000000000059
[ 3775.903089] #PF: supervisor read access in kernel mode
[ 3775.906068] #PF: error_code(0x0000) - not-present page
[ 3775.909102] PGD 0 P4D 0
[ 3775.910656] Oops: Oops: 0000 [#1] SMP NOPTI
[ 3775.913371] CPU: 36 UID: 0 PID: 269131 Comm: stress-ng-yield
Tainted: G W 7.0.0-rc5-00001-g42a93b71138f #5
PREEMPT(full)
[ 3775.920304] Tainted: [W]=WARN
[ 3775.922100] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.17.0-debian-1.17.0-1 04/01/2014
[ 3775.927852] RIP: 0010:pick_task_fair+0x6f/0xb0
[ 3775.930466] Code: 85 ff 74 52 48 8b 47 48 48 85 c0 74 d6 80 78 58
00 74 d0 48 89 3c 24 e8 8f 9b ff ff 48 8b 3c 24 be 01 00 00 00 e8 51
74 ff ff <80> 78 59 00 74
c3 ba 21 00 00 00 48 89 c6 48 89 df e8 5b f1 ff ff
[ 3775.941027] RSP: 0018:ffffc9003827fde0 EFLAGS: 00010086
[ 3775.943949] RAX: 0000000000000000 RBX: ffff8881b972bc40 RCX: 0000000000000803
[ 3775.948179] RDX: 00000041acc1002a RSI: 000000b0cef5382a RDI: 000040138cc6cd49
[ 3775.952149] RBP: ffffc9003827fef8 R08: 0000000000000400 R09: 0000000000000002
[ 3775.956548] R10: 0000000000000024 R11: ffff8881b04a4d40 R12: ffff8881b04a4280
[ 3775.960480] R13: ffff8881b04a4280 R14: ffffffff82ce70a8 R15: ffff8881b972bc40
[ 3775.964713] FS: 00007f6ecb7a6b00(0000) GS:ffff888235beb000(0000)
knlGS:0000000000000000
[ 3775.969468] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3775.972960] CR2: 0000000000000059 CR3: 000000019c32a003 CR4: 0000000000370ef0
[ 3775.977008] Call Trace:
[ 3775.978581] <TASK>
[ 3775.979841] pick_next_task_fair+0x3c/0x8e0
[ 3775.982408] ? lock_is_held_type+0xcd/0x130
[ 3775.984833] __schedule+0x20f/0x14d0
[ 3775.987287] ? do_sched_yield+0xa2/0xe0
[ 3775.989365] schedule+0x3d/0x130
[ 3775.991376] __do_sys_sched_yield+0xe/0x20
[ 3775.993889] do_syscall_64+0xf3/0x680
[ 3775.996229] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 3776.000459] RIP: 0033:0x7f6ecc0e18c7
[ 3776.002757] Code: 73 01 c3 48 8b 0d 49 d5 0e 00 f7 d8 64 89 01 48
83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 18 00 00
00 0f 05 <48> 3d 01 f0 ff
ff 73 01 c3 48 8b 0d 19 d5 0e 00 f7 d8 64 89 01 48

I'll continue digging next week on this, but wanted to share in case
anyone else sees something obvious first.

thanks
-john