Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

From: Tobias Huschle
Date: Mon Apr 22 2024 - 09:14:17 EST


On Fri, Apr 05, 2024 at 12:28:02PM +0200, Peter Zijlstra wrote:
> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> noting that lag is fundamentally a temporal measure. It should not be
> carried around indefinitely.
>
> OTOH it should also not be instantly discarded, doing so will allow a
> task to game the system by purposefully (micro) sleeping at the end of
> its time quantum.
>
> Since lag is intimately tied to the virtual time base, a wall-time
> based decay is also insufficient, notably competition is required for
> any of this to make sense.
>
> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> competing until they are eligible.
>
> Strictly speaking, we only care about keeping them until the 0-lag
> point, but that is a difficult proposition, instead carry them around
> until they get picked again, and dequeue them at that point.
>
> Since we should have dequeued them at the 0-lag point, truncate lag
> (eg. don't let them earn positive lag).
>

I stumbled upon a subtle change between CFS and EEVDF which kinda relates
to the question how long lag should be carried around, but on the other
side of the 0-lag value.

Assume that we have two tasks taking turns on a single CPU.
Task 1 does something and wakes up Task 2.
Task 2 does something and goes to sleep.
And we're just repeating that.
Task 1 and task 2 only run for very short amounts of time, i.e. much
shorter than a regular time slice.

Let's further assume, that task 1 runs longer than task 2.
In CFS, this means, that vruntime of task 1 starts to outrun the vruntime
of task 2. This means that vruntime(task2) < vruntime(task1). Hence, task 2
always gets picked on wake up because it has the smaller vruntime.
In EEVDF, this would translate to a permanent positive lag, which also
causes task 2 to get consistently scheduled on wake up.

Let's now assume, that ocassionally, task 2 runs a little bit longer than
task 1. In CFS, this means, that task 2 can close the vruntime gap by a
bit, but, it can easily remain below the value of task 1. Task 2 would
still get picked on wake up.
With EEVDF, in its current form, task 2 will now get a negative lag, which
in turn, will cause it not being picked on the next wake up.

So, it seems we have a change in the level of how far the both variants look
into the past. CFS being willing to take more history into account, whereas
EEVDF does not (with update_entity_lag setting the lag value from scratch,
and place_entity not taking the original vruntime into account).

All of this can be seen as correct by design, a task consumes more time
than the others, so it has to give way to others. The big difference
is now, that CFS allowed a task to collect some bonus by constantly using
less CPU time than others and trading that time against ocassionally taking
more CPU time. EEVDF could do the same thing, by allowing the accumulation
of (positive) lag, which can then be traded against the one time the task
would get negative lag.

This might of course clash with other EEVDF assumptions but blends with the
question on how long lag should be preserved.

I stumbled upon a performance degredation caused by this nuance. EEVDF is
not necessarily at fault here. The problem rather lies in a component
that relies on CFS allowing the above mentioned bonus aggregation while
knowing that the corresponding task 2 has to run.
See: https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/
With the reasoning above, the degradation can be explained perfectly, and
even be fixed by allowing the accumulation of lag by applying the patch
below.

I was just wondering if my assumptions here are correct and whether this
would be an implicit change that one should be aware about.


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..b83a72311d2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -701,7 +701,7 @@ static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
s64 lag, limit;

SCHED_WARN_ON(!se->on_rq);
- lag = avg_vruntime(cfs_rq) - se->vruntime;
+ lag = se->vlag + avg_vruntime(cfs_rq) - se->vruntime;

limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
se->vlag = clamp(lag, -limit, limit);