Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking

From: Peter Zijlstra

Date: Tue Mar 31 2026 - 08:25:25 EST

On Tue, Mar 31, 2026 at 11:29:09AM +0200, Peter Zijlstra wrote:
> On Tue, Mar 31, 2026 at 02:19:54PM +0530, K Prateek Nayak wrote:
> > On 3/31/2026 12:44 PM, Peter Zijlstra wrote:
> > > On Tue, Mar 31, 2026 at 09:08:23AM +0200, Peter Zijlstra wrote:
> > >> On Tue, Mar 31, 2026 at 06:08:27AM +0530, K Prateek Nayak wrote:
> > >>
> > >>> The above doesn't recover after a avg_vruntime(). Btw I'm running:
> > >>>
> > >>> nice -n 19 stress-ng --yield 32 -t 1000000s&
> > >>> while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
> > >>
> > >> And you're running that on a 16 cpu machine / vm ?
> > >
> > > W00t, it went b00m. Ok, let me go add some tracing.
> >
> > I could only repro it on baremetal after few hours but good to know it
> > exploded effortlessly on your end! Was this a 16vCPU VM with the same
> > recipe?
>
> Yep. It almost insta triggers. Trying to make sense of the traces now.

So the thing I'm seeing is that avg_vruntime() is behind of where it
should be, not much, but every time it goes *boom* it is just far enough
behind that no entity is eligible.

sched-messaging-2192 [039] d..2. 77.136100: pick_task_fair: cfs_rq(39:ff4a5bc7bebeb680): sum_w_vruntime(194325882) sum_weight(5120) zero_vruntime(105210161141318) avg_vruntime(105210161179272)
sched-messaging-2192 [039] d..2. 77.136100: pick_task_fair: T se(ff4a5bc79040c940): vruntime(105210161556539) deadline(105210164099443) weight(1048576) -- sched-messaging:2340
sched-messaging-2192 [039] d..2. 77.136101: pick_task_fair: T se(ff4a5bc794ce98c0): vruntime(105210161435669) deadline(105210164235669) weight(1048576) -- sched-messaging:2212
sched-messaging-2192 [039] d..2. 77.136101: pick_task_fair: T se(ff4a5bc7952d3100): vruntime(105210161580240) deadline(105210164380240) weight(1048576) -- sched-messaging:2381
sched-messaging-2192 [039] d..2. 77.136102: pick_task_fair: T se(ff4a5bc794c318c0): vruntime(105210161818264) deadline(105210164518004) weight(1048576) -- sched-messaging:2306
sched-messaging-2192 [039] d..2. 77.136103: pick_task_fair: T se(ff4a5bc796b4b100): vruntime(105210161831546) deadline(105210164631546) weight(1048576) -- sched-messaging:2551
sched-messaging-2192 [039] d..2. 77.136104: pick_task_fair: min_lag(-652274) max_lag(0) limit(38000000)
sched-messaging-2192 [039] d..2. 77.136104: pick_task_fair: picked NULL!!

If we compute the avg_vruntime() manually, then we get a
sum_w_vruntime contribution for each task:

(105210161556539-105210161141318)*1024
425186304
(105210161435669-105210161141318)*1024
301415424
(105210161580240-105210161141318)*1024
449456128
(105210161818264-105210161141318)*1024
693192704
(105210161831546-105210161141318)*1024
706793472

Which combined is:

425186304+301415424+449456128+693192704+706793472
2576044032

NOTE: this is different (more) from sum_w_vruntime(194325882).

So divided, and added to zero gives:

2576044032/5120
503133.60000000000000000000
105210161141318+503133.60000000000000000000
105210161644451.60000000000000000000

Which is where avg_vruntime() *should* be, except it ends up being at:

avg_vruntime(105210161179272), which then results in no eligible entities.

Note that with the computed avg, the first 3 entities would be eligible.

This suggests I go build a parallel infrastructure to double check when
and where this goes sizeways.

... various attempts later ....

sched-messaging-1021 [009] d..2. 34.483159: update_curr: T<=> se(ff37d0bcd52718c0): vruntime(56921690782736, E) deadline(56921693563331) weight(1048576) -- sched-messaging:1021
sched-messaging-1021 [009] d..2. 34.483160: __avg_vruntime: cfs_rq(9:ff37d0bcfe46b680): delta(-48327) sum_w_vruntime(811471242) zero_vruntime(56921691575188)

sched-messaging-1021 [009] d..2. 34.483160: pick_task_fair: cfs_rq(9:ff37d0bcfe46b680): sum_w_vruntime(811471242) sum_weight(6159) zero_vruntime(56921691575188) avg_vruntime(56921691706941)
sched-messaging-1021 [009] d..2. 34.483160: pick_task_fair: T< se(ff37d0bcd5c6c940): vruntime(56921691276707, E) deadline(56921694076707) weight(1048576) -- sched-messaging:1276
sched-messaging-1021 [009] d..2. 34.483161: pick_task_fair: T se(ff37d0bcd56f98c0): vruntime(56921691917863) deadline(56921694079320) weight(1048576) -- sched-messaging:1201
sched-messaging-1021 [009] d..2. 34.483162: pick_task_fair: T se(ff37d0bcd5344940): vruntime(56921691340323, E) deadline(56921694140323) weight(1048576) -- sched-messaging:1036
sched-messaging-1021 [009] d..2. 34.483163: pick_task_fair: T se(ff37d0bcd56dc940): vruntime(56921691637185, E) deadline(56921694403038) weight(1048576) -- sched-messaging:1179
sched-messaging-1021 [009] d..2. 34.483164: pick_task_fair: T se(ff37d0bcd43eb100): vruntime(56921691629067, E) deadline(56921694429067) weight(1048576) -- sched-messaging:786
sched-messaging-1021 [009] d..2. 34.483164: pick_task_fair: T se(ff37d0bcd5d80080): vruntime(56921691810771) deadline(56921694610771) weight(1048576) -- sched-messaging:1291
sched-messaging-1021 [009] d..2. 34.483165: pick_task_fair: T se(ff37d0bcd027b100): vruntime(56921734696810) deadline(56921917287562) weight(15360) -- stress-ng-yield:693
sched-messaging-1021 [009] d..2. 34.483165: pick_task_fair: min_lag(-42989869) max_lag(430234) limit(38000000)
sched-messaging-1021 [009] d..2. 34.483166: pick_task_fair: swv(811471242)
sched-messaging-1021 [009] d..2. 34.483167: __dequeue_entity: cfs_rq(9:ff37d0bcfe46b680): sum_w_vruntime(1117115786) zero_vruntime(56921691575188)

set_next_task(1276):

swv -= key * weight

811471242 - (56921691276707-56921691575188)*1024
1117115786

OK

sched-messaging-1276 [009] d.h2. 34.483168: update_curr: T<=> se(ff37d0bcd5c6c940): vruntime(56921691285759, E) deadline(56921694076707) weight(1048576) -- sched-messaging:1276
sched-messaging-1276 [009] d.h2. 34.483169: __avg_vruntime: cfs_rq(9:ff37d0bcfe46b680): delta(22156) sum_w_vruntime(319064896) zero_vruntime(56921691597344)

swv -= sw * delta

1117115786 - 5135 * 22156
1003344726

WTF!?!

zv += delta

56921691575188 + 22156
56921691597344

OK

sched-messaging-1276 [009] d.h2. 34.483169: place_entity: T< se(ff37d0bcd52718c0): vruntime(56921690673139, E) deadline(56921693473139) weight(1048576) -- sched-messaging:1021
sched-messaging-1276 [009] d.h2. 34.483170: __enqueue_entity: cfs_rq(9:ff37d0bcfe46b680): sum_w_vruntime(-627321024) zero_vruntime(56921691597344)

swv += key * weight

Should be:

1003344726 + (56921690673139 - 56921691597344) * 1024
56958806 [*]

But is:

319064896 + (56921690673139 - 56921691597344) * 1024
-627321024

Consistent, but wrong

sched-messaging-1276 [009] d..2. 34.483173: update_curr: T<=> se(ff37d0bcd5c6c940): vruntime(56921691289762, E) deadline(56921694076707) weight(1048576) -- sched-messaging:1276
sched-messaging-1276 [009] d..2. 34.483173: __avg_vruntime: cfs_rq(9:ff37d0bcfe46b680): delta(571) sum_w_vruntime(180635073) zero_vruntime(56921691466161)

This would be dequeue(1276) update_entity_lag(), but the numbers make no sense...

swv -= sw * delta

-627321024 - 6159 * 571
-630837813 != 180635073

zv += delta

56921691597344 + 571
56921691597915 != 56921691466161

Also, the actual delta would be (zero_vruntime - prev zero_vruntime):

56921691466161-56921691597344
-131183

At which point we can construct the swv value from where we left of [*]

56958806 - -131183 * 6159
864914903

But the actual state makes no frigging sense....

sched-messaging-1276 [009] d..2. 34.483174: pick_task_fair: cfs_rq(9:ff37d0bcfe46b680): sum_w_vruntime(180635073) sum_weight(6159) zero_vruntime(56921691466161) avg_vruntime(56921691495489)
sched-messaging-1276 [009] d..2. 34.483174: pick_task_fair: T< se(ff37d0bcd52718c0): vruntime(56921690673139, E) deadline(56921693473139) weight(1048576) -- sched-messaging:1021
sched-messaging-1276 [009] d..2. 34.483175: pick_task_fair: T se(ff37d0bcd56f98c0): vruntime(56921691917863) deadline(56921694079320) weight(1048576) -- sched-messaging:1201
sched-messaging-1276 [009] d..2. 34.483175: pick_task_fair: T se(ff37d0bcd5344940): vruntime(56921691340323, E) deadline(56921694140323) weight(1048576) -- sched-messaging:1036
sched-messaging-1276 [009] d..2. 34.483176: pick_task_fair: T se(ff37d0bcd56dc940): vruntime(56921691637185) deadline(56921694403038) weight(1048576) -- sched-messaging:1179
sched-messaging-1276 [009] d..2. 34.483177: pick_task_fair: T se(ff37d0bcd43eb100): vruntime(56921691629067) deadline(56921694429067) weight(1048576) -- sched-messaging:786
sched-messaging-1276 [009] d..2. 34.483177: pick_task_fair: T se(ff37d0bcd5d80080): vruntime(56921691810771) deadline(56921694610771) weight(1048576) -- sched-messaging:1291
sched-messaging-1276 [009] d..2. 34.483178: pick_task_fair: T se(ff37d0bcd027b100): vruntime(56921734696810) deadline(56921917287562) weight(15360) -- stress-ng-yield:693
sched-messaging-1276 [009] d..2. 34.483178: pick_task_fair: min_lag(-43201321) max_lag(822350) limit(38000000)
sched-messaging-1276 [009] d..2. 34.483178: pick_task_fair: swv(864914903)
sched-messaging-1276 [009] d..2. 34.483179: pick_task_fair: FAIL

Generated with the below patch on top of -rc6.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf948db905ed..5462aeac1c45 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -678,6 +678,11 @@ sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)

cfs_rq->sum_w_vruntime += key * weight;
cfs_rq->sum_weight += weight;
+
+ trace_printk("cfs_rq(%d:%px): sum_w_vruntime(%Ld) zero_vruntime(%Ld)\n",
+ rq_of(cfs_rq)->cpu, cfs_rq,
+ cfs_rq->sum_w_vruntime,
+ cfs_rq->zero_vruntime);
}

static void
@@ -688,6 +693,11 @@ sum_w_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)

cfs_rq->sum_w_vruntime -= key * weight;
cfs_rq->sum_weight -= weight;
+
+ trace_printk("cfs_rq(%d:%px): sum_w_vruntime(%Ld) zero_vruntime(%Ld)\n",
+ rq_of(cfs_rq)->cpu, cfs_rq,
+ cfs_rq->sum_w_vruntime,
+ cfs_rq->zero_vruntime);
}

static inline
@@ -698,6 +708,12 @@ void update_zero_vruntime(struct cfs_rq *cfs_rq, s64 delta)
*/
cfs_rq->sum_w_vruntime -= cfs_rq->sum_weight * delta;
cfs_rq->zero_vruntime += delta;
+
+ trace_printk("cfs_rq(%d:%px): delta(%Ld) sum_w_vruntime(%Ld) zero_vruntime(%Ld)\n",
+ rq_of(cfs_rq)->cpu, cfs_rq,
+ delta,
+ cfs_rq->sum_w_vruntime,
+ cfs_rq->zero_vruntime);
}

/*
@@ -712,7 +728,7 @@ void update_zero_vruntime(struct cfs_rq *cfs_rq, s64 delta)
* This means it is one entry 'behind' but that puts it close enough to where
* the bound on entity_key() is at most two lag bounds.
*/
-u64 avg_vruntime(struct cfs_rq *cfs_rq)
+static u64 __avg_vruntime(struct cfs_rq *cfs_rq, bool update)
{
struct sched_entity *curr = cfs_rq->curr;
long weight = cfs_rq->sum_weight;
@@ -743,9 +759,17 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
delta = curr->vruntime - cfs_rq->zero_vruntime;
}

- update_zero_vruntime(cfs_rq, delta);
+ if (update) {
+ update_zero_vruntime(cfs_rq, delta);
+ return cfs_rq->zero_vruntime;
+ }

- return cfs_rq->zero_vruntime;
+ return cfs_rq->zero_vruntime + delta;
+}
+
+u64 avg_vruntime(struct cfs_rq *cfs_rq)
+{
+ return __avg_vruntime(cfs_rq, true);
}

static inline u64 cfs_rq_max_slice(struct cfs_rq *cfs_rq);
@@ -1078,11 +1102,6 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
return best;
}

-static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
-{
- return __pick_eevdf(cfs_rq, true);
-}
-
struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
{
struct rb_node *last = rb_last(&cfs_rq->tasks_timeline.rb_root);
@@ -1279,6 +1298,8 @@ s64 update_curr_common(struct rq *rq)
return update_se(rq, &rq->donor->se);
}

+static void print_se(struct cfs_rq *cfs_rq, struct sched_entity *se, bool pick);
+
/*
* Update the current task's runtime statistics.
*/
@@ -1304,6 +1325,10 @@ static void update_curr(struct cfs_rq *cfs_rq)

curr->vruntime += calc_delta_fair(delta_exec, curr);
resched = update_deadline(cfs_rq, curr);
+ if (resched)
+ avg_vruntime(cfs_rq);
+
+ print_se(cfs_rq, curr, true);

if (entity_is_task(curr)) {
/*
@@ -3849,6 +3874,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
bool rel_vprot = false;
u64 vprot;

+ print_se(cfs_rq, se, true);
+
if (se->on_rq) {
/* commit outstanding execution time */
update_curr(cfs_rq);
@@ -3896,6 +3923,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
__enqueue_entity(cfs_rq, se);
cfs_rq->nr_queued++;
}
+
+ print_se(cfs_rq, se, true);
}

static void reweight_task_fair(struct rq *rq, struct task_struct *p,
@@ -5251,6 +5280,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (se->rel_deadline) {
se->deadline += se->vruntime;
se->rel_deadline = 0;
+ print_se(cfs_rq, se, true);
return;
}

@@ -5266,6 +5296,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* EEVDF: vd_i = ve_i + r_i/w_i
*/
se->deadline = se->vruntime + vslice;
+ print_se(cfs_rq, se, true);
}

static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -5529,31 +5560,6 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, bool first)
se->prev_sum_exec_runtime = se->sum_exec_runtime;
}

-static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags);
-
-/*
- * Pick the next process, keeping these things in mind, in this order:
- * 1) keep things fair between processes/task groups
- * 2) pick the "next" process, since someone really wants that to run
- * 3) pick the "last" process, for cache locality
- * 4) do not run the "skip" process, if something else is available
- */
-static struct sched_entity *
-pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
-{
- struct sched_entity *se;
-
- se = pick_eevdf(cfs_rq);
- if (se->sched_delayed) {
- dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
- /*
- * Must not reference @se again, see __block_task().
- */
- return NULL;
- }
- return se;
-}
-
static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);

static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
@@ -8942,6 +8948,123 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f
resched_curr_lazy(rq);
}

+static __always_inline
+void print_se(struct cfs_rq *cfs_rq, struct sched_entity *se, bool pick)
+{
+ bool curr = (se == cfs_rq->curr);
+ bool el = entity_eligible(cfs_rq, se);
+ bool prot = protect_slice(se);
+ bool task = false;
+ char *comm = NULL;
+ int pid = -1;
+
+ if (entity_is_task(se)) {
+ struct task_struct *p = task_of(se);
+ task = true;
+ comm = p->comm;
+ pid = p->pid;
+ }
+
+ trace_printk("%c%c%c%c se(%px): vruntime(%Ld%s) deadline(%Ld) weight(%ld) -- %s:%d\n",
+ task ? 'T' : '@',
+ pick ? '<' : ' ',
+ curr && prot ? '=' : ' ',
+ curr ? '>' : ' ',
+ se, se->vruntime, el ? ", E" : "",
+ se->deadline, se->load.weight,
+ comm, pid);
+}
+
+static struct sched_entity *pick_debug(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *pick = __pick_eevdf(cfs_rq, true);
+ struct sched_entity *curr = cfs_rq->curr;
+ s64 min_lag = 0, max_lag = 0;
+ u64 runtime, weight, z_vruntime, avg;
+ u64 swv = 0;
+
+ s64 limit = 10*(sysctl_sched_base_slice + TICK_NSEC);
+
+ if (curr && !curr->on_rq)
+ curr = NULL;
+
+ runtime = cfs_rq->sum_w_vruntime;
+ weight = cfs_rq->sum_weight;
+ z_vruntime = cfs_rq->zero_vruntime;
+ barrier();
+ avg = __avg_vruntime(cfs_rq, false);
+
+ trace_printk("cfs_rq(%d:%px): sum_w_vruntime(%Ld) sum_weight(%Ld) zero_vruntime(%Ld) avg_vruntime(%Ld)\n",
+ rq_of(cfs_rq)->cpu, cfs_rq,
+ runtime, weight,
+ z_vruntime, avg);
+
+ for (struct rb_node *node = cfs_rq->tasks_timeline.rb_leftmost;
+ node; node = rb_next(node)) {
+ struct sched_entity *se = __node_2_se(node);
+ if (se == curr)
+ curr = NULL;
+ print_se(cfs_rq, se, pick == se);
+
+ swv += (se->vruntime - z_vruntime) * scale_load_down(se->load.weight);
+
+ s64 vlag = avg - se->vruntime;
+ min_lag = min(min_lag, vlag);
+ max_lag = max(max_lag, vlag);
+ }
+
+ if (curr) {
+ print_se(cfs_rq, curr, pick == curr);
+
+ s64 vlag = avg - curr->vruntime;
+ min_lag = min(min_lag, vlag);
+ max_lag = max(max_lag, vlag);
+ }
+
+ trace_printk(" min_lag(%Ld) max_lag(%Ld) limit(%Ld)\n", min_lag, max_lag, limit);
+ trace_printk(" swv(%Ld)\n", swv);
+
+ if (swv != runtime) {
+ trace_printk("FAIL\n");
+ tracing_off();
+ printk("FAIL FAIL FAIL!!!\n");
+ }
+
+// WARN_ON_ONCE(min_lag < -limit || max_lag > limit);
+
+ if (!pick) {
+ trace_printk("picked NULL!!\n");
+ tracing_off();
+ printk("FAIL FAIL FAIL!!!\n");
+ return __pick_first_entity(cfs_rq);
+ }
+
+ return pick;
+}
+
+/*
+ * Pick the next process, keeping these things in mind, in this order:
+ * 1) keep things fair between processes/task groups
+ * 2) pick the "next" process, since someone really wants that to run
+ * 3) pick the "last" process, for cache locality
+ * 4) do not run the "skip" process, if something else is available
+ */
+static struct sched_entity *
+pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *se;
+
+ se = pick_debug(cfs_rq);
+ if (se->sched_delayed) {
+ dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+ /*
+ * Must not reference @se again, see __block_task().
+ */
+ return NULL;
+ }
+ return se;
+}
+
static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
{
struct sched_entity *se;
@@ -9129,6 +9252,7 @@ static void yield_task_fair(struct rq *rq)
if (entity_eligible(cfs_rq, se)) {
se->vruntime = se->deadline;
se->deadline += calc_delta_fair(se->slice, se);
+ avg_vruntime(cfs_rq);
}
}