Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime

From: K Prateek Nayak

Date: Thu Apr 02 2026 - 06:57:20 EST

Hello Peter,

On 4/2/2026 3:52 PM, Peter Zijlstra wrote:
> On Thu, Apr 02, 2026 at 10:58:18AM +0530, K Prateek Nayak wrote:
>> On 3/30/2026 1:25 PM, K Prateek Nayak wrote:
>>> ------------[ cut here ]------------
>>> (w_vruntime >> 63) != (w_vruntime >> 62)
>>> WARNING: kernel/sched/fair.c:692 at __enqueue_entity+0x382/0x3a0, CPU#5: stress-ng/5062
>>
>> Back to this: I still see this with latest set of changes on
>> queue:sched/urgent but it doesn't go kaboom. Nonetheless, it suggests we
>> are closing in on the s64 limitations of "sum_w_vruntime" which isn't
>> very comforting.
>
> Yeah, we are pushing 64bit pretty hard :/ And if all we would care about
> was x86_64 I'd have long since used the fact that imul has a 128bit
> result and idiv actually divides 128bit. But even among 64bit
> architectures that is somewhat rare :/

Guess we have to make do with what is more abundant. We haven't crashed
and burnt yet so it should be a fun debug for future us when we get
there :-)

>
>> Here is one scenario where it was triggered when running:
>>
>> stress-ng --yield=32 -t 10000000s&
>> while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
>>
>> on a 256CPUs machine after about an hour into the run:
>>
>> __enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0)
>> cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1)
>> cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37)
>>
>> The above comes from __enqueue_entity() after a place_entity(). Breaking
>> this down:
>>
>> vlag_initial = 57498
>> vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754
>>
>> vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055
>> entity_key(se, cfs_rq) = -141,245,081,754
>>
>> Now, multiplying the entity_key with its own weight results to
>> 5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but
>> in Python, without overflow, this would be: -1,2837,944,014,404,397,056
>
> Oh gawd, this is a 'fun' case.
>
>> One way to avoid the warning entirely would be to pull the zero_vruntime
>> close to avg_vruntime is we are enqueuing a very heavy entity.
>>
>> The correct way to do this would be to compute the actual avg_vruntime()
>> and move the zero_vruntime to that point (but that requires at least one
>> multiply + divide + update_zero_vruntime()).
>>
>> One seemingly cheap way by which I've been able to avoid the warning is
>> with:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 226509231e67..bc708bb8b5d0 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5329,6 +5329,7 @@ static void
>> place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>> {
>> u64 vslice, vruntime = avg_vruntime(cfs_rq);
>> + bool update_zero = false;
>> s64 lag = 0;
>>
>> if (!se->custom_slice)
>> @@ -5406,6 +5407,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>> load += avg_vruntime_weight(cfs_rq, curr->load.weight);
>>
>> lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight);
>> + /*
>> + * If the entity_key() * sum_weight of all the enqueued entities
>> + * is more than the sum_w_vruntime, move the zero_vruntime
>> + * point to the vruntime of the entity which prevents using
>> + * more bits than necessary for sum_w_vruntime until the
>> + * next avg_vruntime().
>> + *
>> + * XXX: Cheap enough check?
>> + */
>> + if (abs(lag) > abs(cfs_rq->sum_w_vruntime))
>> + update_zero = true;
>> if (WARN_ON_ONCE(!load))
>> load = 1;
>> lag = div64_long(lag, load);
>> @@ -5413,6 +5425,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>
>> se->vruntime = vruntime - lag;
>>
>> + if (update_zero)
>> + update_zero_vruntime(cfs_rq, -lag);
>> +
>> if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
>> se->deadline += se->vruntime;
>> se->rel_deadline = 0;
>> ---
>>
>> But I'm sure it'll make people nervous since we basically move the
>> zero_vruntime to se->vruntime. It isn't too bad if:
>>
>> abs(sum_w_vuntime - (lag * load)) < abs(lag * se->load.weight)
>>
>> but we already know that the latter overflows so is there any other
>> cheaper indicator that we can use to detect the necessity to adjust the
>> avg_vruntime beforehand at place_entity()?
>
> So in general I think it would be fine to move zero_vruntime to the
> heaviest entity in the tree. And if there are multiple equal heaviest
> weights, any one of them should be fine.
>
> Per necessity heavy entities are more tightly clustered -- the lag is
> inversely proportional to weight, and the spread is proportional to the
> lag bound.
>
> I suspect something simple like comparing the entity weight against the
> sum_weight might be enough. If the pre-existing tree is, in aggregate,
> heavier than the new element, the avg will not move very drastically.
> However, if the new element is (significantly) heavier than the tree,
> the avg will move significantly (as demonstrated here).
>
> That is, something like the below... But with a comment ofc :-)
>
> Does that make sense?

Let me go queue an overnight test to see if I trip that warning or
not. I initially did think this might work but then convinced myself
that testing the spread with "sum_w_vruntime" might prove to be
better but we'll know for sure tomorrow ;-)

--
Thanks and Regards,
Prateek