Re: [PATCH] sched: cgroup SCHED_IDLE support

From: Tejun Heo
Date: Wed Jun 16 2021 - 11:48:21 EST


Hello,

On Tue, Jun 08, 2021 at 04:11:32PM -0700, Josh Don wrote:
> This extends SCHED_IDLE to cgroups.
>
> Interface: cgroup/cpu.idle.
> 0: default behavior
> 1: SCHED_IDLE
>
> Extending SCHED_IDLE to cgroups means that we incorporate the existing
> aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its
> descendant threads towards the idle_h_nr_running count of all of its
> ancestor cgroups. Thus, sched_idle_rq() will work properly.
> Additionally, SCHED_IDLE cgroups are configured with minimum weight.
>
> There are two key differences between the per-task and per-cgroup
> SCHED_IDLE interface:
>
> - The cgroup interface allows tasks within a SCHED_IDLE hierarchy to
> maintain their relative weights. The entity that is "idle" is the
> cgroup, not the tasks themselves.
>
> - Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption
> decision is not made by comparing the current task with the woken task,
> but rather by comparing their matching sched_entity.
>
> A typical use-case for this is a user that creates an idle and a
> non-idle subtree. The non-idle subtree will dominate competition vs
> the idle subtree, but the idle subtree will still be high priority
> vs other users on the system. The latter is accomplished via comparing
> matching sched_entity in the waken preemption path (this could also be
> improved by making the sched_idle_rq() decision dependent on the
> perspective of a specific task).

A high-level problem that I see with the proposal is that this would bake
the current recursive implementation into the interface. The semantics of
the currently exposed interface, at least the weight based part, is abstract
and doesn't necessarily dictate how the scheduling is actually performed.
Adding this would mean that we're now codifying the current behavior of
fully nested scheduling into the interface.

There are several practical challenges with the current implementation
caused by the full nesting - e.g. nesting levels are expensive for context
switch heavy applicaitons often going over >1% per level, and heuristics
which assume global queue may behave unexpectedly - ie. we can create
conditions where things like idle-wakeup boost behave very differently
depending on whether tasks are inside a cgroup or not even when the eventual
relative weights and past usages are similar.

Can you please give more details on why this is beneficial? Is the benefit
mostly around making configuration easy or are there actual scheduling
behaviors that you can't achieve otherwise?

Thanks.

--
tejun