Re: [PATCH 0/4] sched: Various reweight_entity() fixes

From: Peter Zijlstra

Date: Thu Feb 12 2026 - 07:01:04 EST

On Thu, Feb 12, 2026 at 01:13:30PM +0530, K Prateek Nayak wrote:
> Hello Peter,
>
> On 2/11/2026 9:58 PM, Peter Zijlstra wrote:
> > On Wed, Feb 11, 2026 at 12:15:48PM +0100, Vincent Guittot wrote:
> >
> >> Regarding the use of calc_delta_fair() in update_entity_lag(), we use
> >> calc_delta_fair() for updating vruntime, deadline, vprot and vlag and
> >> I wonder how this diff of granularity compared to avg_vruntime can be
> >> an issue for sched_entity with a small weight
> >
> > It will effectively inflate their weight.
> >
> > The below seems to 'work' -- it builds, boots and builds a kernel.
> >
> > We could perhaps look at doing that reciprocal thing on unsigned long,
> > but meh.
>
> So I was testing peterz:sched/core at commit 84230c0ac1cf ("sched/fair:
> Use full weight to __calc_delta()") and it still crashes for the
> exorbitantly large copies of "yes" running on 6 cores.

Ah yes, our next great adventure :-)

> Here is what I've found out from today's crash on dumping the offending
> cfs_rq:
>
> ...
> Enqueue cfs_rq: depth(1) weight(2144337920) nr_queued(2045) sum_w_vruntime(18343811230990336) sum_weight(2144337920) zero_vruntime(18337852274) sum_shift(0) avg_vruntime(18337852274)
> Dequeue cfs_rq: depth(1) weight(2144337920) nr_queued(2045) sum_w_vruntime(1303379968) sum_weight(2143289344) zero_vruntime(18354971669) sum_shift(0) avg_vruntime(18354971669)

This is about where things go sideways... (see argument below)

> Enqueue cfs_rq: depth(1) weight(2152726528) nr_queued(2053) sum_w_vruntime(-31233567079006208) sum_weight(2152726528) zero_vruntime(18360916028) sum_shift(0) avg_vruntime(18360916028)
> Dequeue cfs_rq: depth(1) weight(2152726528) nr_queued(2053) sum_w_vruntime(-25773095996358656) sum_weight(2151677952) zero_vruntime(18366916795) sum_shift(0) avg_vruntime(18366916795)
> Enqueue cfs_rq: depth(1) weight(2152726528) nr_queued(2053) sum_w_vruntime(-195993505307295744) sum_weight(2152726528) zero_vruntime(18437453448) sum_shift(0) avg_vruntime(18437453448)
> Dequeue cfs_rq: depth(1) weight(2152726528) nr_queued(2053) sum_w_vruntime(-355777007188967424) sum_weight(2151677952) zero_vruntime(18520289238) sum_shift(0) avg_vruntime(18520289238)
> Enqueue cfs_rq: depth(1) weight(2152726528) nr_queued(2053) sum_w_vruntime(-1523774576650092544) sum_weight(2152726528) zero_vruntime(19054245803) sum_shift(0) avg_vruntime(19054245803)
> Dequeue cfs_rq: depth(1) weight(2152726528) nr_queued(2053) sum_w_vruntime(-3015239905359953920) sum_weight(2151677952) zero_vruntime(19756286051) sum_shift(0) avg_vruntime(19756286051)

This is definitely weird; so all these tasks should be more or less
equally spread out around avg_runtime with a +- of that lag bound. This
means that sum_w_vruntime should be in the order of walltime-delta.

To see this; let us remind that:

dt_i
dv_i = ----
w_i

dt
dV = -- ; where: W = \Sum w_i and dt = \Sum dt_i
W

(Virtual) lag is then defined as the difference between V and v_i. In
the fluid model the lag is 0 -- since everybody always advances equally.
In the discrete model this is translated to \Sum lag_i := 0. Anyway, it
can be shown that \Sum vlag_i := 0, is equivalent to:

\Sum w_i * v_i
V = --------------
W

In differential form that gives:

dt_i
\Sum w_i * ----
w_i \Sum dt_i dt
dV = ---------------- = --------- = --
W W W

Now: sum_w_vruntime := dt-dt_curr, sum_weight := W-w_curr

[ one way of looking at it is that sum_w_vruntime + curr->vruntime
carries the remainder of the division, while zero_vruntime carries whe
whole part ]

Anyway, your trace shows both zero_vruntime and avg_vruntime increasing,
this means dt is positive. However, at the same time it shows
sum_w_vruntime being increasingly negative.

The only way this can be is for curr->vruntime to be increasinly
positive such that the sum ends up being a 'small' positive number.

IOW, we're not running the right tasks (and wrecking the lag bounds in
the process).

>
> # I'm suspecting something goes sideways at this point looking at
> # the big jump that comes after in avg_vruntime()
> # "6222537296247259136" has bit 63 set so could it be a wraparound
> # in sum_w_vruntime?
>
> Enqueue cfs_rq: depth(1) weight(2152726528) nr_queued(2053) sum_w_vruntime(6222537296247259136) sum_weight(2152726528) zero_vruntime(24024889432) sum_shift(0) avg_vruntime(24024889432)
> Dequeue cfs_rq: depth(1) weight(2152726528) nr_queued(2053) sum_w_vruntime(-5928597400870453248) sum_weight(2151677952) zero_vruntime(21110281285) sum_shift(0) avg_vruntime(21110281285)
>
> cfs_rq of failed pick:
> cfs_rq: depth(0) weight(2176843776) nr_queued(2076) sum_w_vruntime(-90907067772043264) sum_weight(2175795200) zero_vruntime(26921355273)
> se: weight(1048576) vruntime(843832351) slice(2800000) deadline(846630448) curr?(1) task?(1) delayed?(0) se_depth(1)
> se: weight(1048576) vruntime(839841550) slice(2800000) deadline(842635020) curr?(0) task?(1) delayed?(0) se_depth(1)
> ...
>
> # Many entities have weight 1048576, and and vruntime around ~850,000
>
> ....
> se: weight(1048576) vruntime(843832916) slice(2800000) deadline(846629681) curr?(0) task?(1) delayed?(0) se_depth(1)
> se: weight(1048576) vruntime(843833563) slice(2800000) deadline(846630939) curr?(0) task?(1) delayed?(0) se_depth(1)
> se: weight(1048576) vruntime(843840684) slice(2800000) deadline(846633192) curr?(0) task?(1) delayed?(0) se_depth(1)
> se: weight(1048576) vruntime(18344670326) slice(2800000) deadline(18347465508) curr?(0) task?(1) delayed?(0) se_depth(1)
> se: weight(1048576) vruntime(18344679852) slice(2800000) deadline(18347475866) curr?(0) task?(1) delayed?(0) se_depth(1)
> se: weight(1048576) vruntime(18344682061) slice(2800000) deadline(18347477764) curr?(0) task?(1) delayed?(0) se_depth(1)
> ...
>
> # Many entities have weight 1048576, and and vruntime around ~18,344,682,061
>
>
> I'm running with CONFIG_HZ=250. In vruntime_eligible() we have for curr:
>
> entity_key(curr) = -26077522922LL
> weight = 1048576UL
>
>
> For the cfs_rq we get:
>
> avg = sum_w_runtime + (entity_key * weight) = -118251332447502336LL
> load = sum_weight + weight = 2176843776UL
>
>
> But on the way to check eligibility we compute:
>
> entity_key * load = -56766693466253033472 (in python)
> entity_key * load = -1426461245124378624LL (in C)
>
>
> vruntime_eligible() returns false always as a result of the overflow and
> we land in a pickle. Not sure if sum_w_vruntime itself can suffer from
> a wrap around that leads to all this.
>
> Parsing the stats based on the dump in Python gives me:
>
> Total weight: 2176843776 (checks out)
> Total weighted vruntime: 3145101113938673664
>
> Avg (floating point): 1444798725.849711
> Avg (signed 64-bit): 1444798725
> Avg (unsigned 64-bit): 1444798725
>
> which is a bit far off from the dumped stats of cfs_rq.

So eligibility is lag > 0, or v_i < V. Thus we have:

v_i < V_0 + dt/W

Because divisions are expensive, we've rearranged things:

(v_i - V_0) < dt/W

(v_i - V_0) * W < dt

Now, (v_i - V_0) is entity_key() and we've argued that:

(v_i - V_0) * w_i ~ 44 bits

But now we do * W, which is n*w_i (assuming, as is the case here that
all our tasks of equal weight). Still this would allow n to be of the
order of 20 bits, far larger than the 2k it is.

But yes, if (v_i - V_0) far exceeds the limits imposed by the lag
bounds, this will go sideways.

Let me go puzzle...

For you, slice+TICK_NSEC should be something like: 6800000, and given
everything is weight '1', we would expect vruntime to always be:

avg_vruntime +- 6800000

But that is clearly not happening.