Re: [PATCH] sched: cgroup SCHED_IDLE support

From: Josh Don
Date: Tue Jun 15 2021 - 19:31:00 EST


On Tue, Jun 15, 2021 at 3:07 AM Dietmar Eggemann
<dietmar.eggemann@xxxxxxx> wrote:
>
> On 12/06/2021 01:34, Josh Don wrote:
> > On Fri, Jun 11, 2021 at 9:43 AM Dietmar Eggemann
> > <dietmar.eggemann@xxxxxxx> wrote:
> >>
> >> On 10/06/2021 21:14, Josh Don wrote:

[snip]

> > Setting cpu.idle carries additional properties in addition to just the
> > weight. Currently, it primarily includes (a) special wakeup preemption
> > handling, and (b) contribution to idle_h_nr_running for the purpose of
> > marking a cpu as a sched_idle_cpu(). Essentially, the current
> > SCHED_IDLE mechanics. I've also discussed with Peter a potential
> > extension to SCHED_IDLE to manipulate vruntime.
>
> Right, I forgot about (b).
>
> But IMHO, (a) could be handled with this special tg->shares value for
> SCHED_IDLE.
>
> If there would be a way to open up `cpu.weight`, `cpu.weight.nice` (and
> `cpu,shares` for v1) to take a special value for SCHED_IDLE, then you
> won't need cpu.idle.
> And you could handle the functionality from sched_group_set_idle()
> directly in sched_group_set_shares().
> In this case sched_group_set_shares() wouldn't have to be rejected on an
> idle tg.
> A tg would just become !idle by writing a different cpu.weight value.
> Currently, if you !idle a tg it gets the default NICE_0_LOAD.
>
> I guess cpu.weight [1, 10000] would be easy, 0 could be taken for that
> and mapped into weight = WEIGHT_IDLEPRIO (3, 3072) to call
> sched_group_set_shares(..., scale_load(weight).
> cpu.weight = 1 maps to (10, 10240)
>
> cpu.weight.nice [-20, 19] would be already more complicated, 20?
>
> And for cpu.shares [2, 2 << 18] 0 could be used. The issue here is that
> WEIGHT_IDLEPRIO (3, 3072) is a valid value already for shares.
>
> > We set the cgroup weight here, since by definition SCHED_IDLE entities
> > have the least scheduling weight. From the perspective of your
> > question, the analogous statement for tasks would be that we set task
> > weight to the min when doing setsched(SCHED_IDLE), even though we
> > already have a renice mechanism.
>
> I agree. `cpu.idle = 1` is like setting the task policy to SCHED_IDLE.
> And there is even the `cpu.weight.nice` to support the `task - tg`
> analogy on nice values.
>
> I'm just wondering if integrating this into `cpu.weight` and friends
> would be better to make the code behind this easier to grasp.

Might be confusing that we manipulate cgroup SCHED_IDLE via special
weight/weight.nice values, whereas task SCHED_IDLE is manipulated via
setsched (and we would allow nice 20 for cgroups but not for tasks,
for example).

More importantly, I think it would be better to avoid unintentional
side effects caused by overloading the weight interfaces. Thoughts on
that?

Another (less compelling) reason for the separate interface is that it
allows further extension. For example, negative values in cpu.idle
could indicate increasingly more latency-sensitive cgroups, which
could benefit from better wakeup preemption, entity placement, etc.