Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime

From: K Prateek Nayak

Date: Thu Apr 02 2026 - 01:28:33 EST

On 3/30/2026 1:25 PM, K Prateek Nayak wrote:
> ------------[ cut here ]------------
> (w_vruntime >> 63) != (w_vruntime >> 62)
> WARNING: kernel/sched/fair.c:692 at __enqueue_entity+0x382/0x3a0, CPU#5: stress-ng/5062

Back to this: I still see this with latest set of changes on
queue:sched/urgent but it doesn't go kaboom. Nonetheless, it suggests we
are closing in on the s64 limitations of "sum_w_vruntime" which isn't
very comforting.

Here is one scenario where it was triggered when running:

stress-ng --yield=32 -t 10000000s&
while true; do perf bench sched messaging -p -t -l 100000 -g 16; done

on a 256CPUs machine after about an hour into the run:

__enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0)
cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1)
cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37)

The above comes from __enqueue_entity() after a place_entity(). Breaking
this down:

vlag_initial = 57498
vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754

vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055
entity_key(se, cfs_rq) = -141,245,081,754

Now, multiplying the entity_key with its own weight results to
5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but
in Python, without overflow, this would be: -1,2837,944,014,404,397,056

Now, the fact that it doesn't crash suggests to me the later
avg_vruntime() calculation would restore normality and the
sum_w_vruntime turns to -57498 (vlag_initial) * 90891264 (weight) =
-5,226,065,897,472 (assuming curr's vruntime is still the same) which
only requires 43 bits.

I also added the following at the bottom of dequeue_entity():

WARN_ON_ONCE(!cfs_rq->nr_queued && cfs_rq->sum_w_vruntime)

which was never triggered when the cfs_rq goes idle so it isn't like we
didn't account sum_w_vruntime properly. There was just a momentary
overflow so we are fine but will it always be that way?

One way to avoid the warning entirely would be to pull the zero_vruntime
close to avg_vruntime is we are enqueuing a very heavy entity.

The correct way to do this would be to compute the actual avg_vruntime()
and move the zero_vruntime to that point (but that requires at least one
multiply + divide + update_zero_vruntime()).

One seemingly cheap way by which I've been able to avoid the warning is
with:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 226509231e67..bc708bb8b5d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5329,6 +5329,7 @@ static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
u64 vslice, vruntime = avg_vruntime(cfs_rq);
+ bool update_zero = false;
s64 lag = 0;

if (!se->custom_slice)
@@ -5406,6 +5407,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
load += avg_vruntime_weight(cfs_rq, curr->load.weight);

lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight);
+ /*
+ * If the entity_key() * sum_weight of all the enqueued entities
+ * is more than the sum_w_vruntime, move the zero_vruntime
+ * point to the vruntime of the entity which prevents using
+ * more bits than necessary for sum_w_vruntime until the
+ * next avg_vruntime().
+ *
+ * XXX: Cheap enough check?
+ */
+ if (abs(lag) > abs(cfs_rq->sum_w_vruntime))
+ update_zero = true;
if (WARN_ON_ONCE(!load))
load = 1;
lag = div64_long(lag, load);
@@ -5413,6 +5425,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)

se->vruntime = vruntime - lag;

+ if (update_zero)
+ update_zero_vruntime(cfs_rq, -lag);
+
if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
se->deadline += se->vruntime;
se->rel_deadline = 0;
---

But I'm sure it'll make people nervous since we basically move the
zero_vruntime to se->vruntime. It isn't too bad if:

abs(sum_w_vuntime - (lag * load)) < abs(lag * se->load.weight)

but we already know that the latter overflows so is there any other
cheaper indicator that we can use to detect the necessity to adjust the
avg_vruntime beforehand at place_entity()?

--
Thanks and Regards,
Prateek