Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime

From: K Prateek Nayak

Date: Mon Mar 30 2026 - 04:03:14 EST

On 2/23/2026 5:21 PM, Peter Zijlstra wrote:
>> We should always have
>> key < 110ms (max slice+max tick) * nice_0 (2^20) / weight (2)
>> key < 2^46
>>
>> We can use 50 bits to get margin
>>
>> Weight is always less than 27bits and key*weight gives us 110ms (max
>> slice+max tick) * nice_0 (2^20) so we should never add more than 2^47
>> to ->sum_weight
>>
>> so a WARN_ONCE (cfs_rq->sum_weight > 2^63) should be enough
>
> Ha, I was >< close to pushing out these patches when I saw this.
>
> The thing is signed, so bit 63 is the sign bit, but I suppose we can
> test bit 62 like so:
>
> Let me go build and boot that.
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -679,9 +679,13 @@ static inline void
> __sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight);
> - s64 key = entity_key(cfs_rq, se);
> + s64 w_vruntime, key = entity_key(cfs_rq, se);
>
> - cfs_rq->sum_w_vruntime += key * weight;
> + w_vruntime = key * weight;
> +
> + WARN_ON_ONCE((w_vruntime >> 63) != (w_vruntime >> 62));

I was trying to reproduce the crash that John mentioned on the Patch 1
and although I couldn't reproduce that crash (yet), I tripped this when
running stress-ng yield test (32 copies x 256 children + sched messaging
16 groups) on my dual socket system (2 x 64C/128T):

------------[ cut here ]------------
(w_vruntime >> 63) != (w_vruntime >> 62)
WARNING: kernel/sched/fair.c:692 at __enqueue_entity+0x382/0x3a0, CPU#5: stress-ng/5062
Modules linked in: ...
CPU: 5 UID: 1000 PID: 5062 Comm: stress-ng Not tainted 7.0.0-rc5-topo-test+ #40 PREEMPT(full)
Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
RIP: 0010:__enqueue_entity+0x382/0x3a0
Code: 4c 89 4b 48 4c 89 4b 50 e9 61 fe ff ff 83 f9 3f 0f 87 b8 27 e5 ff 49 d3 ec b8 02 00 00 00 49 39 c4 4c 0f 42 e0 e9 16 ff ff ff <0f> 0b e9 d8 fc ff ff 0f 0b e9 e1 fe ff ff 0f 0b 66 66 2e 0f 1f 84
RSP: 0018:ffffcf6b8ea88c18 EFLAGS: 00010002
RAX: bf38ba3b09dc2400 RBX: ffff8d546f832680 RCX: ffffffffffffffff
RDX: fffffffffffffffe RSI: ffff8d1587818080 RDI: ffff8d546f832680
RBP: ffff8d1587818080 R08: 000000000000002d R09: 00000000ffffffff
R10: 0000000000000001 R11: ffffcf6b8ea88ff8 R12: 00000000056ae400
R13: ffff8d1587818080 R14: 0000000000000001 R15: ffff8d546f832680
FS: 00007f8c742c9740(0000) GS:ffff8d54b0c82000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2e09381358 CR3: 00000040db039002 CR4: 0000000000f70ef0
PKRU: 55555554
Call Trace:
<IRQ>
enqueue_task_fair+0x1a3/0xe50
? srso_alias_return_thunk+0x5/0xfbef5
? place_entity+0x21/0x160
enqueue_task+0x88/0x1b0
ttwu_do_activate+0x74/0x1c0
try_to_wake_up+0x277/0x840
...
asm_sysvec_call_function_single+0x1a/0x20
RIP: 0010:do_sched_yield+0x73/0xa0
Code: 89 df 48 8b 80 e8 02 00 00 48 8b 40 18 e8 75 a9 fd 00 65 ff 05 9e 94 fc 02 66 90 48 8d 7b 48 e8 d3 96 fd 00 fb 0f 1f 44 00 00 <65> ff 0d 86 94 fc 02 5b e9 10 12 fd 00 83 83 70 0d 00 00 01 eb bb
RSP: 0018:ffffcf6b9b7bbd78 EFLAGS: 00000282
RAX: ffffffffbbbb4560 RBX: ffff8d546f832580 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8d546f8325c8
RBP: ffffcf6b9b7bbf38 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d15d1070000
R13: 0000000000000018 R14: 0000000000000000 R15: 0000000000000018
? __pfx_yield_task_fair+0x10/0x10
? do_sched_yield+0x6d/0xa0
__do_sys_sched_yield+0xe/0x20
...

Since this wasn't suppose to trip, I'm assuming we are somehow in the
wrap around territory again :-(

I don't see anything particularly interesting in the sched/debug
entry after the fact:

cfs_rq[5]:/user.slice
.left_deadline : 26249498461.397509
.left_vruntime : 26249498270.250843
.zero_vruntime : 26249456859.395628
.sum_w_vruntime : 2547538312135680 (51 bits)
.sum_weight : 61440
.sum_shift : 0
.avg_vruntime : 26249498338.158417
.right_vruntime : 26249498381.633124
.spread : 111.382281
.nr_queued : 5
.h_nr_runnable : 5
.h_nr_queued : 5
.h_nr_idle : 0
...

I still haven't figured out how this happens I'll start running with
some debug prints next.

On a tangential note, now that we only yield on eligibility, would
something like below make sense?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 226509231e67..55ab1f58d703 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9265,9 +9265,10 @@ static void yield_task_fair(struct rq *rq)
struct sched_entity *se = &curr->se;

/*
- * Are we the only task in the tree?
+ * Single task is always eligible on the cfs_rq.
+ * Don't pull the vruntime needlessly.
*/
- if (unlikely(rq->nr_running == 1))
+ if (unlikely(cfs_rq->nr_queued == 1))
return;

clear_buddies(cfs_rq, se);
--
Thanks and Regards,
Prateek