Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking

From: Peter Zijlstra

Date: Mon Mar 30 2026 - 06:15:58 EST


On Fri, Mar 27, 2026 at 10:44:28PM -0700, John Stultz wrote:
> On Wed, Feb 18, 2026 at 11:58 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > It turns out that zero_vruntime tracking is broken when there is but a single
> > task running. Current update paths are through __{en,de}queue_entity(), and
> > when there is but a single task, pick_next_task() will always return that one
> > task, and put_prev_set_next_task() will end up in neither function.
> >
> > This can cause entity_key() to grow indefinitely large and cause overflows,
> > leading to much pain and suffering.
> >
> > Furtermore, doing update_zero_vruntime() from __{de,en}queue_entity(), which
> > are called from {set_next,put_prev}_entity() has problems because:
> >
> > - set_next_entity() calls __dequeue_entity() before it does cfs_rq->curr = se.
> > This means the avg_vruntime() will see the removal but not current, missing
> > the entity for accounting.
> >
> > - put_prev_entity() calls __enqueue_entity() before it does cfs_rq->curr =
> > NULL. This means the avg_vruntime() will see the addition *and* current,
> > leading to double accounting.
> >
> > Both cases are incorrect/inconsistent.
> >
> > Noting that avg_vruntime is already called on each {en,de}queue, remove the
> > explicit avg_vruntime() calls (which removes an extra 64bit division for each
> > {en,de}queue) and have avg_vruntime() update zero_vruntime itself.
> >
> > Additionally, have the tick call avg_vruntime() -- discarding the result, but
> > for the side-effect of updating zero_vruntime.
>
> Hey all,
>
> So in stress testing with my full proxy-exec series, I was
> occasionally tripping over the situation where __pick_eevdf() returns
> null which quickly crashes.

> The backtrace is usually due to stress-ng stress-ng-yield test:

Suppose we have 2 runnable tasks, both doing yield. Then one will be
eligible and one will not be, because the average position must be in
between these two entities.

Therefore, the runnable task will be eligible, and be promoted a full
slice (all the tasks do is yield after all). This causes it to jump over
the other task and now the other task is eligible and it is no longer.
So we schedule.

Since we are runnable, there is no dequeue or enqueue. All we have is
the __enqueue_entity() and __dequeue_entity() from put_prev_task() /
set_next_task(). But per the fingered commit, those two no longer move
zero_vruntime head.

All that moves zero_vruntime is tick and full dequeue or enqueue.

This means, that if the two tasks playing leapfrog can reach the
critical speed to reach the overflow point inside one tick's worth of
time, we're up a creek.

If this is indeed the case, then the below should cure things.

This also means that running a HZ=100 config will increase the chances
of hitting this vs HZ=1000.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9298f49f842c..c7daaf941b26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9307,6 +9307,7 @@ static void yield_task_fair(struct rq *rq)
if (entity_eligible(cfs_rq, se)) {
se->vruntime = se->deadline;
se->deadline += calc_delta_fair(se->slice, se);
+ avg_vruntime(cfs_rq);
}
}