[PATCH] sched/fair: clamp rescaled vlag in reweight_entity() to bound entity_key()
From: Rik van Riel
Date: Wed Jun 10 2026 - 10:56:18 EST
A 252-CPU machine running an EEVDF kernel hard-locked up. The trigger was
the s64 overflow guard in __sum_w_vruntime_add():
WARNING: CPU: 181 ... at kernel/sched/fair.c __enqueue_entity+0x1fc
...
__enqueue_entity / put_prev_entity / put_prev_task_fair / __schedule
firing during a reschedule, resulting in the CPU trying to wake up
the printk worker, while already holding the runqueue lock, resulting
in a deadlock.
Root cause for this scheduler bug is in the reweight path, not the
enqueue that tripped the WARN.
reweight_entity() preserves an entity's lag across a weight change by
scaling vlag in rescale_entity():
se->vlag = vl' = vl * old_weight / new_weight;
and then, for an on_rq entity, recomputes:
se->vruntime = avruntime - se->vlag;
On a large weight decrease (w' << w) this inflates vlag without bound;
nothing re-clamps it to the per-entity lag limit that entity_lag() and
update_entity_lag() enforce everywhere else. The deadline is rescaled
and re-based separately, so only vruntime drifts.
For a group entity reweighted via update_cfs_group() this lets
se->vruntime drift arbitrarily far below cfs_rq->zero_vruntime. A later
__enqueue_entity() then computes
key = entity_key(cfs_rq, se) = se->vruntime - cfs_rq->zero_vruntime
weight = avg_vruntime_weight(cfs_rq, se->load.weight)
w_vruntime = key * weight
and trips the s64 overflow guard in __sum_w_vruntime_add():
WARN_ON_ONCE((w_vruntime >> 63) != (w_vruntime >> 62));
leaving the avg-vruntime accounting and the rb-tree ordering corrupt,
which in turn wedges the load balancer (every CPU spinning in
sched_balance_rq()) into a hard lockup.
Observed on a 252-CPU machine on a depth-7 group sched_entity:
cfs_rq->zero_vruntime = -503694424797
se->vruntime = -15964975487901
key = -15461281063104 (~2^43.8)
se->load.weight = 438118, sum_shift = 0
key * weight = -6773865536804998272 = -(2^62.55) -> WARN
Re-clamp vlag to the same bound entity_lag() uses, after the new weight
is installed:
limit = calc_delta_fair(cfs_rq_max_slice(cfs_rq) + TICK_NSEC, se);
se->vlag = clamp(se->vlag, -limit, limit);
Since calc_delta_fair(t, se) * se->load.weight <= t << NICE_0_LOAD_SHIFT
(equality but for the floor in __calc_delta), this bounds key * weight by
(max_slice + TICK_NSEC) << NICE_0_LOAD_SHIFT independent of the entity's
weight. For the case above (max_slice = 3 ms, NICE_0_LOAD = 1 << 20):
limit = (4000000 << 20) / 438118 = 9573457 (~2^23.2)
|key| clamped from 2^43.8 -> 2^23.2 (20.6 bits, ~1.6e6x)
key * weight <= 9573457 * 438118 = 4194303833926 = 2^41.9
i.e. ~20 bits of headroom below the 2^62 guard, so the overflow and the
resulting lockup cannot occur. For legitimate reweights the rescaled
vlag is already within the bound, so the clamp is a no-op.
Fixes: 4823725d9d1d ("sched/fair: Increase weight bits for avg_vruntime")
Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
Assisted-by: Claude:claude-opus-4-8
---
PS: do we want some LOGLEVEL_SCHED equivalent WARN_ON for inside
the scheduler, so we can warn without a deadlock?
kernel/sched/fair.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1b23e73f48b0..49b48c5f5746 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4669,6 +4669,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
bool curr = cfs_rq->curr == se;
bool rel_vprot = false;
u64 avruntime = 0;
+ s64 limit;
if (se->on_rq) {
/* commit outstanding execution time */
@@ -4693,6 +4694,23 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
update_load_set(&se->load, weight);
+ /*
+ * rescale_entity() scaled vlag by old_weight/new_weight to preserve
+ * lag across the reweight (vl' = vl * w/w'). On a large weight
+ * decrease this can inflate vlag well past the legal lag bound. Left
+ * unclamped, the resulting se->vruntime = avruntime - vlag (computed
+ * just below for an on_rq entity, or via place_entity() on the next
+ * enqueue for an off_rq one) drifts far from cfs_rq->zero_vruntime, and
+ * a subsequent __enqueue_entity() then overflows entity_key() * weight
+ * in __sum_w_vruntime_add(). Re-clamp to the per-entity lag limit for
+ * the new weight, exactly as entity_lag() does on every fresh lag.
+ * Note calc_delta_fair(t, se) * se->load.weight <= t << NICE_0_LOAD_SHIFT
+ * (equality but for the floor in __calc_delta), so this bounds
+ * key * weight regardless of the entity's weight.
+ */
+ limit = calc_delta_fair(cfs_rq_max_slice(cfs_rq) + TICK_NSEC, se);
+ se->vlag = clamp(se->vlag, -limit, limit);
+
do {
u32 divider = get_pelt_divider(&se->avg);
se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
--
2.53.0-Meta