Re: [RFC][PATCH 8/8] sched/eevdf: Move to a single runqueue

From: K Prateek Nayak

Date: Wed Mar 18 2026 - 05:49:59 EST

Hello Peter,

On 3/18/2026 2:32 PM, Peter Zijlstra wrote:
> On Tue, Mar 17, 2026 at 11:16:52PM +0530, K Prateek Nayak wrote:
>
>>> + /*
>>> + * XXX comment on the curr thing
>>> + */
>>> + curr = (cfs_rq->curr == se);
>>> + if (curr)
>>> + place_entity(cfs_rq, se, flags);
>>>
>>> + if (se->on_rq && se->sched_delayed)
>>> + requeue_delayed_entity(cfs_rq, se);
>>>
>>> + weight = enqueue_hierarchy(p, flags);
>>
>> Here is question I had when I first saw this on sched/flat and I've
>> only looked at the series briefly:
>>
>> enqueue_hierarchy() would end up updating the averages, and reweighing
>> the hierarchical load of the entities in the new task's hierarchy ...
>>
>>>
>>> + if (!curr) {
>>> + reweight_eevdf(cfs_rq, se, weight, false);
>>> + place_entity(cfs_rq, se, flags | ENQUEUE_QUEUED);
>>
>> ... and the hierarchical weight of the newly enqueued task would be
>> based on this updated hierarchical proportion.
>>
>> However, the tasks that are already queued have their deadlines
>> calculated based on the old hierarchical proportions at the time they
>> were enqueued / during the last task_tick_fair() for an entity that
>> was put back.
>>
>> Consider two tasks of equal weight on cgroups with equal weights:
>>
>> root (weight: 1024)
>> / \
>> CG0 CG1 (wight(CG0,CG1) = 512)
>> | |
>> T0 T1 (h_weight(T0,T1) = 256)
>>
>>
>> and a third task of equal weight arrives (for the sake of simplicity
>> also consider both cgroups have saturated their respective global
>> shares on this CPU - similar to UP mode):
>>
>>
>> root (weight: 1024)
>> / \
>> (weight: 512) CG0 CG1 (weight: 512)
>> / / \
>> (h_weight(T0) = 256) T0 T1 T2 (h_weight(T2) = 128)
>>
>> (h_weight(T1) = 256)
>>
>>
>> Logically, once T2 arrives, T1 should also be reweighed, it's
>> hierarchical proportions be adjusted, and its vruntime and deadline
>> be also adjusted accordingly based on the lag but that doesn't
>> happen.
>
> You are absolutely right.
>
>> Instead, we continue with an approximation of h_load as seen
>> sometime during the past. Is that alright with EEVDF or am I missing
>> something?
>
> Strictly speaking it is dodgy as heck ;-) I was hoping that on average
> it would all work out. Esp. since PELT is a fairly slow and smooth
> function, the reweights will mostly be minor adjustments.

For a stable system, that is correct, but with a bunch of migration
in the mix, even the averages tend to move quite rapidly which is
why we already ratelimit the tg->shares calculation to once per
millisecond :-)

>
>> Can it so happen that on SMP, future enqueues, and SMP conditions
>> always lead to larger h_load for the newly enqueued tasks and as a
>> result the older tasks become less favorable for the pick leading
>> to starvation? (Am I being paranoid?)
>
> So typically the most recent enqueue will always have the smaller
> fraction of the group weight. This would lead to a slight favour to the
> older enqueue. So I think this would lead to a FIFO like bias.
>
> But there is definitely some fun to be had here.
>
> One definite fix is setting cgroup_mode to 'up' :-)

A definite fix indeed but I'm pretty sure people will start complaining
about more preemptions, etc. now that their cgroups have lost the
nice -20 equivalent privilege on these large systems and people have to
go and change these shares to painfully small values for performance.

>
>>> + __enqueue_entity(cfs_rq, se);
>>> }
>>>
>>> if (!rq_h_nr_queued && rq->cfs.h_nr_queued)
>>
>> Anyhow, me goes and sees if any of this makes a difference to the
>> benchmarks - I'll throw the biggest one at it first and see how
>> that goes.
>
> Thanks, fingers crossed. :-)

Last time I ran sched/flat, it held up surprisingly well actually.
Performance was not terribly bad (and evne better in some cases) but
that was before we flushed out the whole increased weight bits for
EEVDF calculations so maybe it is all better now.

We'll know soon enough ;-)

--
Thanks and Regards,
Prateek