Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime

From: Peter Zijlstra

Date: Thu Apr 02 2026 - 06:25:01 EST

On Thu, Apr 02, 2026 at 10:58:18AM +0530, K Prateek Nayak wrote:
> On 3/30/2026 1:25 PM, K Prateek Nayak wrote:
> > ------------[ cut here ]------------
> > (w_vruntime >> 63) != (w_vruntime >> 62)
> > WARNING: kernel/sched/fair.c:692 at __enqueue_entity+0x382/0x3a0, CPU#5: stress-ng/5062
>
> Back to this: I still see this with latest set of changes on
> queue:sched/urgent but it doesn't go kaboom. Nonetheless, it suggests we
> are closing in on the s64 limitations of "sum_w_vruntime" which isn't
> very comforting.

Yeah, we are pushing 64bit pretty hard :/ And if all we would care about
was x86_64 I'd have long since used the fact that imul has a 128bit
result and idiv actually divides 128bit. But even among 64bit
architectures that is somewhat rare :/

> Here is one scenario where it was triggered when running:
>
> stress-ng --yield=32 -t 10000000s&
> while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
>
> on a 256CPUs machine after about an hour into the run:
>
> __enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0)
> cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1)
> cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37)
>
> The above comes from __enqueue_entity() after a place_entity(). Breaking
> this down:
>
> vlag_initial = 57498
> vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754
>
> vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055
> entity_key(se, cfs_rq) = -141,245,081,754
>
> Now, multiplying the entity_key with its own weight results to
> 5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but
> in Python, without overflow, this would be: -1,2837,944,014,404,397,056

Oh gawd, this is a 'fun' case.

> One way to avoid the warning entirely would be to pull the zero_vruntime
> close to avg_vruntime is we are enqueuing a very heavy entity.
>
> The correct way to do this would be to compute the actual avg_vruntime()
> and move the zero_vruntime to that point (but that requires at least one
> multiply + divide + update_zero_vruntime()).
>
> One seemingly cheap way by which I've been able to avoid the warning is
> with:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 226509231e67..bc708bb8b5d0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5329,6 +5329,7 @@ static void
> place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> u64 vslice, vruntime = avg_vruntime(cfs_rq);
> + bool update_zero = false;
> s64 lag = 0;
>
> if (!se->custom_slice)
> @@ -5406,6 +5407,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> load += avg_vruntime_weight(cfs_rq, curr->load.weight);
>
> lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight);
> + /*
> + * If the entity_key() * sum_weight of all the enqueued entities
> + * is more than the sum_w_vruntime, move the zero_vruntime
> + * point to the vruntime of the entity which prevents using
> + * more bits than necessary for sum_w_vruntime until the
> + * next avg_vruntime().
> + *
> + * XXX: Cheap enough check?
> + */
> + if (abs(lag) > abs(cfs_rq->sum_w_vruntime))
> + update_zero = true;
> if (WARN_ON_ONCE(!load))
> load = 1;
> lag = div64_long(lag, load);
> @@ -5413,6 +5425,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>
> se->vruntime = vruntime - lag;
>
> + if (update_zero)
> + update_zero_vruntime(cfs_rq, -lag);
> +
> if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
> se->deadline += se->vruntime;
> se->rel_deadline = 0;
> ---
>
> But I'm sure it'll make people nervous since we basically move the
> zero_vruntime to se->vruntime. It isn't too bad if:
>
> abs(sum_w_vuntime - (lag * load)) < abs(lag * se->load.weight)
>
> but we already know that the latter overflows so is there any other
> cheaper indicator that we can use to detect the necessity to adjust the
> avg_vruntime beforehand at place_entity()?

So in general I think it would be fine to move zero_vruntime to the
heaviest entity in the tree. And if there are multiple equal heaviest
weights, any one of them should be fine.

Per necessity heavy entities are more tightly clustered -- the lag is
inversely proportional to weight, and the spread is proportional to the
lag bound.

I suspect something simple like comparing the entity weight against the
sum_weight might be enough. If the pre-existing tree is, in aggregate,
heavier than the new element, the avg will not move very drastically.
However, if the new element is (significantly) heavier than the tree,
the avg will move significantly (as demonstrated here).

That is, something like the below... But with a comment ofc :-)

Does that make sense?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9298f49f842c..7fbd9538fe30 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5329,6 +5329,7 @@ static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
u64 vslice, vruntime = avg_vruntime(cfs_rq);
+ bool update_zero = false;
s64 lag = 0;

if (!se->custom_slice)
@@ -5345,7 +5346,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
*/
if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
struct sched_entity *curr = cfs_rq->curr;
- long load;
+ long load, weight;

lag = se->vlag;

@@ -5405,14 +5406,21 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (curr && curr->on_rq)
load += avg_vruntime_weight(cfs_rq, curr->load.weight);

- lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight);
+ weight = avg_vruntime_weight(cfs_rq, se->load.weight);
+ lag *= load + weight;
if (WARN_ON_ONCE(!load))
load = 1;
lag = div64_long(lag, load);
+
+ if (weight > load)
+ update_zero = true;
}

se->vruntime = vruntime - lag;

+ if (update_zero)
+ update_zero_vruntime(cfs_rq, -lag);
+
if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
se->deadline += se->vruntime;
se->rel_deadline = 0;