Re: [PATCH 0/4] sched: Various reweight_entity() fixes
From: K Prateek Nayak
Date: Fri Feb 13 2026 - 01:06:54 EST
Hello Peter,
On 2/13/2026 12:59 AM, Peter Zijlstra wrote:
> On Thu, Feb 12, 2026 at 06:16:11PM +0100, Peter Zijlstra wrote:
>> On Thu, Feb 12, 2026 at 12:59:43PM +0100, Peter Zijlstra wrote:
>>> On Thu, Feb 12, 2026 at 01:13:30PM +0530, K Prateek Nayak wrote:
>>
>>>> Enqueue cfs_rq: depth(1) weight(2144337920) nr_queued(2045) sum_w_vruntime(18343811230990336) sum_weight(2144337920) zero_vruntime(18337852274) sum_shift(0) avg_vruntime(18337852274)
>>>> Dequeue cfs_rq: depth(1) weight(2144337920) nr_queued(2045) sum_w_vruntime(1303379968) sum_weight(2143289344) zero_vruntime(18354971669) sum_shift(0) avg_vruntime(18354971669)
>>
>> After staring at waaay too many traces of my own, which confirm what
>> you're seeing, but didn't want to make any sense either...
>>
>> ... I thinks I found it ...
>>
>> Note that sum_weight, that has _just_ flipped bit 31 and thus turned
>> negative *if* it were a 32bit number. Every time, exactly at this point
>> things started to go sideways.
>>
>> So I went looking for where this might be and I found the below.
>>
>> My latest run has passed 0x8000000/(1024*1024) = 2048 tasks and is still
>> running. Numbers are still sane, fingers crossed.
>>
>
> I made it all the way to 3k tasks on the one CPU and all is well.
So the current peterz:sched/core refused to boot to prompt for (hopefully
it is just you forgetting to push out the latest branch ;-)
I have the following situation:
Pick failed to find eligible entities. Dumping cfs_rq
cfs_rq: depth(0) weight(3145728) nr_queued(3) sum_w_vruntime(-3074459814163644416) sum_weight(3145728) zero_vruntime(18446688365342623754) sum_shift(0)
cfs_rq after avg_vruntime(): sum_w_vruntime(1048576) sum_weight(3145728) zero_vruntime(18446687387998169890) avg_vruntime(18446687387998169890)
se: weight(1048576) vruntime(18446691297375985345) slice(700000) deadline(18446691297376644150) curr?(1) task?(1) se_depth(0) eligible?(1)
se: weight(1048576) vruntime(18446682501282949076) slice(700000) deadline(18446682501283649076) curr?(0) task?(1) se_depth(0) eligible?(0)
se: weight(1048576) vruntime(18446688365342622487) slice(700000) deadline(18446688365342972487) curr?(0) task?(1) se_depth(0) eligible?(0)
se: weight(1048576) vruntime(18446691297368938108) slice(700000) deadline(18446691297369619108) curr?(0) task?(1) se_depth(0) eligible?(1)
Note: sum_w_vruntime grows pretty large before but on calling
avg_vruntime(), things are back to normal.
Since this is so early into the boot I traced the heck out of
it and one of the observation was how the sum_weight() keeps
growing from the initial zero_vruntime:
Legend: '|' is a place_entity(), "->" is enqueue, "<-" is dequeue.
If the line contains "*" it is "before" the action.
Without the "*" is "after" the action.
Line with only "cfs_rq" line before and after avg_vruntime() in entity_tick()
vruntime: CPU(0) se vruntime(18446744073708503040) vlag(0) deadline(0) | cfs_rq sum_w_vruntime(0) sum_weight(0) zero_vruntime(18446744073708503040)
vruntime: CPU(0) *se vruntime(18446744073708503040) vlag(0) deadline(18446744073708853040) -> cfs_rq sum_w_vruntime(0) sum_weight(0) zero_vruntime(18446744073708503040)
vruntime: CPU(0) se vruntime(18446744073708503040) vlag(0) deadline(18446744073708853040) -> cfs_rq sum_w_vruntime(0) sum_weight(1048576) zero_vruntime(18446744073708503040)
vruntime: CPU(0) se vruntime(18446744073708503040) vlag(0) deadline(0) | cfs_rq sum_w_vruntime(0) sum_weight(1048576) zero_vruntime(18446744073708503040)
vruntime: CPU(0) *se vruntime(18446744073708503040) vlag(0) deadline(18446744073708853040) -> cfs_rq sum_w_vruntime(0) sum_weight(1048576) zero_vruntime(18446744073708503040)
vruntime: CPU(0) se vruntime(18446744073708503040) vlag(0) deadline(18446744073708853040) -> cfs_rq sum_w_vruntime(0) sum_weight(2097152) zero_vruntime(18446744073708503040)
vruntime: CPU(0) *se vruntime(18446744073708503040) vlag(0) deadline(18446744073708853040) <- cfs_rq sum_w_vruntime(0) sum_weight(2097152) zero_vruntime(18446744073708503040)
vruntime: CPU(0) se vruntime(18446744073708503040) vlag(0) deadline(18446744073708853040) <- cfs_rq sum_w_vruntime(0) sum_weight(1048576) zero_vruntime(18446744073708503040)
vruntime: CPU(0) *cfs_rq: cfs_rq sum_w_vruntime(0) sum_weight(1048576) zero_vruntime(18446744073708503040)
At this point:
curr->vruntime = 951329; /* should be same as computed avg */
u64 zero_vruntime = (u64)(-1048576LL); /* From init */
s64 sum_w_vruntime = 0;
s64 delta = curr->vruntime - cfs_rq->zero_vruntime = 1999905; /* Net positive. */
sum_w_vruntime -= 1999905 * sum_weight;
sum_w_vruntime = -2097052385280; /* Checks out! */
vruntime: CPU(0) cfs_rq: cfs_rq sum_w_vruntime(-2097052385280) sum_weight(1048576) zero_vruntime(951329) avg_vruntime(951329)
After this point all delta are positive and we keep
subtracting from sum_w_vruntime.
vruntime: CPU(0) *cfs_rq: cfs_rq sum_w_vruntime(-2097052385280) sum_weight(1048576) zero_vruntime(951329)
vruntime: CPU(0) cfs_rq: cfs_rq sum_w_vruntime(-4194182365184) sum_weight(1048576) zero_vruntime(2951308) avg_vruntime(2951308)
vruntime: CPU(0) *cfs_rq: cfs_rq sum_w_vruntime(-4194182365184) sum_weight(1048576) zero_vruntime(2951308)
vruntime: CPU(0) cfs_rq: cfs_rq sum_w_vruntime(-6291298713600) sum_weight(1048576) zero_vruntime(4951274) avg_vruntime(4951274)
vruntime: CPU(0) *cfs_rq: cfs_rq sum_w_vruntime(-6291298713600) sum_weight(1048576) zero_vruntime(4951274)
vruntime: CPU(0) cfs_rq: cfs_rq sum_w_vruntime(-8388566056960) sum_weight(1048576) zero_vruntime(6951384) avg_vruntime(6951384)
vruntime: CPU(0) *cfs_rq: cfs_rq sum_w_vruntime(-8388566056960) sum_weight(1048576) zero_vruntime(6951384)
vruntime: CPU(0) cfs_rq: cfs_rq sum_w_vruntime(-10485571256320) sum_weight(1048576) zero_vruntime(8951244) avg_vruntime(8951244)
vruntime: CPU(0) *cfs_rq: cfs_rq sum_w_vruntime(-10485571256320) sum_weight(1048576) zero_vruntime(8951244)
vruntime: CPU(0) cfs_rq: cfs_rq sum_w_vruntime(-12582749470720) sum_weight(1048576) zero_vruntime(10951269) avg_vruntime(10951269)
vruntime: CPU(0) *cfs_rq: cfs_rq sum_w_vruntime(-12582749470720) sum_weight(1048576) zero_vruntime(10951269)
vruntime: CPU(0) cfs_rq: cfs_rq sum_w_vruntime(-14680032542720) sum_weight(1048576) zero_vruntime(12951394) avg_vruntime(12951394)
I still haven't found how we end up at a situation where avg_vruntime
is close to -56685711381726LL at the time of crash :-(
Will update if I find something.
--
Thanks and Regards,
Prateek