Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime

From: Peter Zijlstra

Date: Tue Apr 07 2026 - 08:01:16 EST

On Fri, Apr 03, 2026 at 09:32:22AM +0530, K Prateek Nayak wrote:
> On 4/2/2026 4:26 PM, K Prateek Nayak wrote:
> >> That is, something like the below... But with a comment ofc :-)
> >>
> >> Does that make sense?
> >
> > Let me go queue an overnight test to see if I trip that warning or
> > not.
>
> Didn't trip any warning and the machine is still up and running
> after 15 Hours so feel free to include:
>
> Tested-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
>
> Perhaps the comment can read something like:
>
> /*
> * A heavy entity can pull the avg_vruntime close to its
> * vruntime post enqueue but the zero_vruntime point is
> * only updated at the next update_deadline() / enqueue
> * / dequeue.
> *
> * Until then, the sum_w_vruntime grow quadratically,
> * proportional to the entity's weight (w_i) as:
> *
> * sum_w_vruntime -= (lag_i * (W + w_i) / W) * w_i
> *
> * If w_i > W, it is beneficial to pull the
> * zero_vruntime towards the entity's vruntime (V_i)
> * since the sum_w_vruntime would only grow by
> * (lag_i * W) which consumes lesser bits than leaving
> * the zero_vruntime at the pre-enqueue avg_vruntime.
> */
> if (weight > load)
> update_zero = true;
>
> Feel free to reword as you see fit :-)

I've made it like so. You did all the hard work after all. Thanks!

---
Subject: sched/fair: Avoid overflow in enqueue_entity()
From: K Prateek Nayak <kprateek.nayak@xxxxxxx>
Date: Tue Apr 7 13:36:17 CEST 2026

Here is one scenario which was triggered when running:

stress-ng --yield=32 -t 10000000s&
while true; do perf bench sched messaging -p -t -l 100000 -g 16; done

on a 256CPUs machine after about an hour into the run:

__enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0)
cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1)
cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37)

The above comes from __enqueue_entity() after a place_entity(). Breaking
this down:

vlag_initial = 57498
vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754

vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055
entity_key(se, cfs_rq) = -141,245,081,754

Now, multiplying the entity_key with its own weight results to
5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but
in Python, without overflow, this would be: -1,2837,944,014,404,397,056

Avoid the overflow (without doing the division for avg_vruntime()), by moving
zero_vruntime to the new entity when it is heavier.

Fixes: 4823725d9d1d ("sched/fair: Increase weight bits for avg_vruntime")
Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
[peterz: suggested 'weight > load' condition]
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
---
kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5352,6 +5352,7 @@ static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
u64 vslice, vruntime = avg_vruntime(cfs_rq);
+ bool update_zero = false;
s64 lag = 0;

if (!se->custom_slice)
@@ -5368,7 +5369,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
*/
if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
struct sched_entity *curr = cfs_rq->curr;
- long load;
+ long load, weight;

lag = se->vlag;

@@ -5428,14 +5429,41 @@ place_entity(struct cfs_rq *cfs_rq, stru
if (curr && curr->on_rq)
load += avg_vruntime_weight(cfs_rq, curr->load.weight);

- lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight);
+ weight = avg_vruntime_weight(cfs_rq, se->load.weight);
+ lag *= load + weight;
if (WARN_ON_ONCE(!load))
load = 1;
lag = div64_long(lag, load);
+
+ /*
+ * A heavy entity (relative to the tree) will pull the
+ * avg_vruntime close to its vruntime position on enqueue. But
+ * the zero_vruntime point is only updated at the next
+ * update_deadline()/place_entity()/update_entity_lag().
+ *
+ * Specifically (see the comment near avg_vruntime_weight()):
+ *
+ * sum_w_vruntime = \Sum (v_i - v0) * w_i
+ *
+ * Note that if v0 is near a light entity, both terms will be
+ * small for the light entity, while in that case both terms
+ * are large for the heavy entity, leading to risk of
+ * overflow.
+ *
+ * OTOH if v0 is near the heavy entity, then the difference is
+ * larger for the light entity, but the factor is small, while
+ * for the heavy entity the difference is small but the factor
+ * is large. Avoiding the multiplication overflow.
+ */
+ if (weight > load)
+ update_zero = true;
}

se->vruntime = vruntime - lag;

+ if (update_zero)
+ update_zero_vruntime(cfs_rq, -lag);
+
if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
se->deadline += se->vruntime;
se->rel_deadline = 0;