Re: [RFC][PATCH 8/8] sched/eevdf: Move to a single runqueue

From: K Prateek Nayak

Date: Tue Mar 17 2026 - 13:47:36 EST

Hello Peter,

On 3/17/2026 3:21 PM, Peter Zijlstra wrote:
> Change fair/cgroup to a single runqueue.

Looks like Christmas arrived early :-)

[..snip..]

> + /*
> + * XXX comment on the curr thing
> + */
> + curr = (cfs_rq->curr == se);
> + if (curr)
> + place_entity(cfs_rq, se, flags);
>
> - se->slice = slice;
> - if (se != cfs_rq->curr)
> - min_vruntime_cb_propagate(&se->run_node, NULL);
> - slice = cfs_rq_min_slice(cfs_rq);
> + if (se->on_rq && se->sched_delayed)
> + requeue_delayed_entity(cfs_rq, se);
>
> - cfs_rq->h_nr_runnable += h_nr_runnable;
> - cfs_rq->h_nr_queued++;
> - cfs_rq->h_nr_idle += h_nr_idle;
> + weight = enqueue_hierarchy(p, flags);

Here is question I had when I first saw this on sched/flat and I've
only looked at the series briefly:

enqueue_hierarchy() would end up updating the averages, and reweighing
the hierarchical load of the entities in the new task's hierarchy ...

>
> - if (cfs_rq_is_idle(cfs_rq))
> - h_nr_idle = 1;
> + if (!curr) {
> + reweight_eevdf(cfs_rq, se, weight, false);
> + place_entity(cfs_rq, se, flags | ENQUEUE_QUEUED);

... and the hierarchical weight of the newly enqueued task would be
based on this updated hierarchical proportion.

However, the tasks that are already queued have their deadlines
calculated based on the old hierarchical proportions at the time they
were enqueued / during the last task_tick_fair() for an entity that
was put back.

Consider two tasks of equal weight on cgroups with equal weights:

root (weight: 1024)
/ \
CG0 CG1 (wight(CG0,CG1) = 512)
| |
T0 T1 (h_weight(T0,T1) = 256)

and a third task of equal weight arrives (for the sake of simplicity
also consider both cgroups have saturated their respective global
shares on this CPU - similar to UP mode):

root (weight: 1024)
/ \
(weight: 512) CG0 CG1 (weight: 512)
/ / \
(h_weight(T0) = 256) T0 T1 T2 (h_weight(T2) = 128)

(h_weight(T1) = 256)

Logically, once T2 arrives, T1 should also be reweighed, it's
hierarchical proportions be adjusted, and its vruntime and deadline
be also adjusted accordingly based on the lag but that doesn't
happen.

Instead, we continue with an approximation of h_load as seen
sometime during the past. Is that alright with EEVDF or am I missing
something?

Can it so happen that on SMP, future enqueues, and SMP conditions
always lead to larger h_load for the newly enqueued tasks and as a
result the older tasks become less favorable for the pick leading
to starvation? (Am I being paranoid?)

> + __enqueue_entity(cfs_rq, se);
> }
>
> if (!rq_h_nr_queued && rq->cfs.h_nr_queued)

Anyhow, me goes and sees if any of this makes a difference to the
benchmarks - I'll throw the biggest one at it first and see how
that goes.

--
Thanks and Regards,
Prateek