Re: [PATCH RESEND] sched/fair: Fix overflow in vruntime_eligible()

From: David Laight

Date: Fri May 01 2026 - 09:06:14 EST

On Fri, 1 May 2026 12:40:06 +0200
Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> On Wed, Apr 29, 2026 at 10:33:48AM +0530, K Prateek Nayak wrote:
>
> > > Anyway... let me construct a worst case.
> > >
> > > So if you have this cgroup crap on, then you can have an entity of
> > > weight 2, and vlag should then be bounded by: (slice+TICK_NSEC) *
> > > NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().
> > >
> > > The other end is 100*NICE_0_LOAD, so lets wake that, then you get:
> > >
> > > {key, weight}[] := {
> > > puny: { (slice + TICK_NSEC) * NICE_0_LOAD, 2 },
> > > max: { 0, 100*NICE_0_LOAD },
> > > }
> >
> > So MAX_SHARES is (1UL << 18) and we then scale it up so the weight can
> > can go up to (1 << 28) I suppose in UP setting.
> >
> > So let us assume there is a single task with the largest custom slice on
> > each cgroup making the min_slice equal to 100ms. Things start off in
> > such a way that puny having a vruntime of say -1 is picked over max which
> > had a vruntime of 0 for the sake of simplicity.
> >
> > Deadline then becomes:
> >
> > puny->deadline = -1 + __calc_delta(100ms, NICE_0_LOAD, 2)
> >
> > which is 52428800000000 - 1.
> >
> > At 10HZ tick + RUN_TO_PARITY, we get 10ms error so that puts the
> > entity_key() at 57671680000000 (46 bits)
> >
> > >
> > > The avg_vruntime() would end up being very close to 0 (which is
> > > zero_vruntime), so no real help making that more accurate.
> > >
> > > vruntime_eligible(puny) ends up with:
> > >
> > > avg = 2 * puny.key (+ 0)
> > > weight = 2 + 100 * NICE_0_LOAD
> > >
> > > avg >= puny.key * weight
> > >
> > > And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100
> > > > and yes, I suppose that can exceed 64bit :-(
> >
> > Then the worst case right side would be:
> >
> > 57671680000000 * load /* load = ((1 << 28) + 2) */
> >
> > and yes that goes beyond 64-bits at 15481123719086080000000.
> >
> > I'm running with this configurations:
> >
> > echo 2 > /sys/fs/cgroup/cg0/cpu.weight
> > echo 10000 > /sys/fs/cgroup/cg1/cpu.weight
> >
> > Weight of 2 actually gets scaled up to 20480 for UP scenario so I go try
> > to make it even smaller with SMP and wait until it gets picked again.
>
> Make sure all of the actual activity of that cgroup happens 'elsewhere'
> and you can get it down to an actual 2 per calc_group_shares().
>
> > without any splat so I'm not sure if there is something that prevents
> > a possible crash since a weight of 104857600 should have definitely
> > made that entity_key() overflow.
>
> Right.
>
> Anyway, I had a poke around with godbolt, and the below seems to
> generate the best code for things like x86_64 and arm64.
>
> Specifically, the __builtin_mul_overflow() already has to compute the
> 128 bit product anyway for most architectures, so using that directly
> then leads to saner asm and easier to understand code.

Older versions of gcc (maybe before 10?) generate a call to the full
128 bit multiply (_muldid3 ?) on mips64 - rather than just using the instruction
that generates the high part of the product.
The library function is now included but is slightly more complex because
it does the high*low multiples as well.

The same happens on sparc64 and the asm multiply code looks strange.

>
> AFAICT HPPA64 is the only 64bit architecture that doesn't implement
> __int128 and will thus be demoted to doing what we do on 32bit.
>
> Now all I need to do is write a coherent Changelog to go with it ...
> *sigh*
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 728965851842..214fe9d99834 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -904,7 +904,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
> load += weight;
> }
>
> - return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
> + return avg >= (sched_double_long_t)vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
> }
>
> int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9f63b15d309d..f15f4468efe5 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -146,7 +146,7 @@ extern struct list_head asym_cap_list;
> * Really only required when CONFIG_FAIR_GROUP_SCHED=y is also set, but to
> * increase coverage and consistency always enable it on 64-bit platforms.
> */
> -#ifdef CONFIG_64BIT
> +#if defined(CONFIG_64BIT) && defined(__SIZEOF_INT128__)
> # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
> # define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
> # define scale_load_down(w) \
> @@ -157,10 +157,12 @@ extern struct list_head asym_cap_list;
> __w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT); \
> __w; \
> })
> +typedef __int128 sched_double_long_t;
> #else
> # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
> # define scale_load(w) (w)
> # define scale_load_down(w) (w)
> +typedef s64 sched_double_long_t;

Seems strange that this is signed but the 64bit one is unsigned.

-- David

> #endif
>
> /*
>