Re: [PATCH 6/6 v2] sched/eevdf: Speedup short slice task scheduling

From: Peter Zijlstra

Date: Tue Jun 16 2026 - 07:03:52 EST

On Mon, Jun 15, 2026 at 06:24:20PM +0200, Vincent Guittot wrote:
> When a task with a shorter slice is enqueued, we protect the running
> task which has a longer slice until it becomes ineligible instead of a
> full slice in order to speedup the switch to other tasks until the task
> with the shortest slice is scheduled. This helps to the task to not wait
> too many full slices before running.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> ---
> kernel/sched/fair.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 601c67cff185..994fcf3ea702 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1091,7 +1091,10 @@ static inline void set_protect_slice(struct cfs_rq *cfs_rq, struct sched_entity
> slice = cfs_rq_min_slice(cfs_rq);
>
> slice = min(slice, se->slice);
> - if (vruntime != se->vruntime || slice != se->slice)
> +
> + if (sched_feat(PREEMPT_SHORT) && slice < se->slice)
> + vprot = avg_vruntime(cfs_rq);
> + else if ((vruntime != se->vruntime) || (slice != se->slice))
> vprot = min_vruntime(vprot, vruntime + calc_delta_fair(slice, se));
>
> se->vprot = vprot;

I am not entirely sure I understand this one.

avg_vruntime() could be ahead of se->deadline, esp for very short
slices. This would then extend protection beyond the one slice..

Aside from that, there are but two protect_slice() callers that matter:

- pick_eevdf(): this already has a hard limit on avg_vruntime()

- update_curr(): this will trigger preemption when reaching either
->deadline or ->vprot.

Also, the purpose of vprot is similar to the old min_gran, ensure any
task gets *some* time and avoid the degenerate case of endlessly
scheduling without 'any' real progress.

For EEVDF this happens when tasks get arbitrarily close to
avg_vruntime(). Eg, you have the two tasks A,B with A a virtual ns
before avg (and per necessity the other 1 ns after). You run A until its
just past B, find its not longer eligible, switch to B and do the same.
This then results in max frequency context switches and minimal actual
progress.

The thing that was supposed to stop this is vprot, but if you
consistently set vprot at avg_vruntime, this is effectively disabling
vprot. No?

Now, the conditions for this are such that this only happens for all
tasks not of the minimal slice length in the tree. So in order words,
you get spikes of high frequency scheduling just to burn vtime in order
to achieve eligibility for the earliest min_slice task, right?

So what you really want is not avg_vruntime() but the actual
se->vruntime of this earliest min_slice entity. Then we can simply run
whatever task and not get hit with high frequency scheduling, and still
achieve minimal latency for the waiting task.

Now, we don't actually have a convenient way to get this specific task,
but would something like so work?

if (sched_feat(PREEMPT_SHORT) && slice != se->slice)
vprot = min_vruntime(vprot, __pick_root_entity(cfs_rq)->vruntime);

That is, we protect until the next earliest task becomes eligible.

Or did I go off the rails somewhere?