Re: CFS flat runqueue proposal fixes/update

From: Dietmar Eggemann
Date: Fri Aug 07 2020 - 10:14:35 EST


Hi Rik,

On 31/07/2020 09:42, Rik van Riel wrote:
> Hello,
>
> last year at Linux Plumbers conference, I presented on my work
> of turning the hierarchical CFS runqueue into a flat runqueue,
> and Paul Turner pointed out some corner cases that could not
> work with my design as it was last year.
>
> Paul pointed out two corner cases, and I have come up with a
> third one myself, but I believe they all revolve around the
> same issue, admission control, and can all be solved with the
> same solution.

[...]

> Possible solution
>
> A possible solution to the problems above consists of
> three parts:
> - Instead of accounting all used wall clock CPU time into
> vruntime immediately, at the task's current hierarchical
> priority, the vruntime can be accounted for piecemeal, for
> example in amounts corresponding to "one timeslice at full
> priority".
> - If there is only one runnable task on the runqueue,
> all the runtime can be accounted into vruntime in one go.
> - Tasks that cannot account all of their used CPU time
> into vruntime at once can be removed from the root runqueue,
> and placed into the cgroup runqueue. A heap of cgroup
> runqueues with "overloaded" tasks can be attached to the
> main runqueue, where the left-most task from that heap of
> heaps gets some vruntime accounted every time we go into
> pick_next_task.
> - The difference between the vruntime of each task in that
> heap and the vruntime of the root runqueue can help determine
> how much vruntime can be accounted to that task at once.
> - If the task, or its runqueue, is no longer the left-most
> in the heap after getting vruntime accounted, that runqueue
> and the queue of runqueues can be resorted.
> - Once a task has accounted all of its outstanding delta
> exec runtime into vruntime, it can be moved back to the
> main runqueue.
> - This should solve the unequal task weight scenario Paul
> Turner pointed out last year, because after task t1 goes
> to sleep and only t2 and t3 remain on the CPU, t2 will
> get its delta exec runtime converted into vruntime at
> its new priority (equal to t3).
> - By only accounting delta exec runtime to vruntime for
> the left-most task in the "overloaded" heap at one time,
> we guarantee only one task at a time will be added back
> into the root runqueue.
> - Every time a task is added to the root runqueue, that
> slows down the rate at which vruntime advances, which
> in turn reduces the rate at which tasks get added back
> into the runqueue, and makes it more likely that a currently
> running task with low hierarchical priority gets booted
> off into the "overloaded" heap.
>
> To tackle the thundering herd at task wakeup time, another
> strategy may be needed. One thing we may be able to do
> there is place tasks into the "overloaded" heap immediately
> on wakeup, if the hierarchical priority of the task is so
> low that if the task were to run a minimal timeslice length,
> it would be unable to account that time into its vruntime
> in one go, AND the CPU already has a larger number of tasks
> on it.
>
> Because the common case on most systems is having just
> 0, 1, or 2 runnable tasks on a CPU, this fancy scheme
> should rarely end up being used, and even when it is the
> overhead should be reasonable because most of the
> overloaded tasks will just sit there until pick_next_task
> gets around to them.
>
> Does this seem like something worth trying out?
>
> Did I overlook any other corner cases that would make this
> approach unworkable?
>
> Did I forget to explain anything that is needed to help
> understand the problem and the proposed solution better?

I imagine that I can see what you want to achieve here ;-)

But it's hard since your v5 RFC
https://lkml.kernel.org/r/20190906191237.27006-1-riel@xxxxxxxxxxx is
pretty old by now.

Do you have a version of the patch-set against tip/sched/core? Quite a
lot has changed (runnable load avg replaced by runnable avg, rq->load is
gone, CFS load balance rework).

IIRC, the 'CFS flat runqueue design' has the advantage to reduce the
overhead in taskgroup heavy environments like systemd. And I recall that
v5 doesn't cover CFS bandwidth control yet.

IMHO it would be extremely helpful to have a current patch-set to
discuss how these other problems can be covered by patches on top.