Re: [PATCH] sched: cgroup SCHED_IDLE support

From: Dietmar Eggemann
Date: Tue Jun 15 2021 - 06:07:50 EST


On 12/06/2021 01:34, Josh Don wrote:
> On Fri, Jun 11, 2021 at 9:43 AM Dietmar Eggemann
> <dietmar.eggemann@xxxxxxx> wrote:
>>
>> On 10/06/2021 21:14, Josh Don wrote:
>>> Hey Dietmar,
>>>
>>> On Thu, Jun 10, 2021 at 5:53 AM Dietmar Eggemann
>>> <dietmar.eggemann@xxxxxxx> wrote:
>>>>
>>>> Any reason why this should only work on cgroup-v2?
>>>
>>> My (perhaps incorrect) assumption that new development should not
>>> extend v1. I'd actually prefer making this work on v1 as well; I'll
>>> add that support.
>>>
>>>> struct cftype cpu_legacy_files[] vs. cpu_files[]
>>>>
>>>> [...]
>>>>
>>>>> @@ -11340,10 +11408,14 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
>>>>>
>>>>> static DEFINE_MUTEX(shares_mutex);
>>>>>
>>>>> -int sched_group_set_shares(struct task_group *tg, unsigned long shares)
>>>>> +#define IDLE_WEIGHT sched_prio_to_weight[ARRAY_SIZE(sched_prio_to_weight) - 1]
>>>>
>>>> Why not 3 ? Like for tasks (WEIGHT_IDLEPRIO)?
>>>>
>>>> [...]
>>>
>>> Went back and forth on this; on second look, I do think it makes sense
>>> to use the IDLEPRIO weight of 3 here. This gets converted to a 0,
>>> rather than a 1 for display of cpu.weight, which is also actually a
>>> nice property.
>>
>> I'm struggling to see the benefit here.
>>
>> For a taskgroup A: Why setting A/cpu.idle=1 to force a minimum A->shares
>> when you can set it directly via A/cpu.weight (to 1 (minimum))?
>>
>> WEIGHT cpu.weight tg->shares
>>
>> 3 0 3072
>>
>> 15 1 15360
>>
>> 1 10240
>>
>> `A/cpu.weight` follows cgroup-v2's `weights` `resource distribution
>> model`* but I can only see `A/cpu.idle` as a layer on top of it forcing
>> `A/cpu.weight` to get its minimum value?
>>
>> *Documentation/admin-guide/cgroup-v2.rst
>
> Setting cpu.idle carries additional properties in addition to just the
> weight. Currently, it primarily includes (a) special wakeup preemption
> handling, and (b) contribution to idle_h_nr_running for the purpose of
> marking a cpu as a sched_idle_cpu(). Essentially, the current
> SCHED_IDLE mechanics. I've also discussed with Peter a potential
> extension to SCHED_IDLE to manipulate vruntime.

Right, I forgot about (b).

But IMHO, (a) could be handled with this special tg->shares value for
SCHED_IDLE.

If there would be a way to open up `cpu.weight`, `cpu.weight.nice` (and
`cpu,shares` for v1) to take a special value for SCHED_IDLE, then you
won't need cpu.idle.
And you could handle the functionality from sched_group_set_idle()
directly in sched_group_set_shares().
In this case sched_group_set_shares() wouldn't have to be rejected on an
idle tg.
A tg would just become !idle by writing a different cpu.weight value.
Currently, if you !idle a tg it gets the default NICE_0_LOAD.


I guess cpu.weight [1, 10000] would be easy, 0 could be taken for that
and mapped into weight = WEIGHT_IDLEPRIO (3, 3072) to call
sched_group_set_shares(..., scale_load(weight).
cpu.weight = 1 maps to (10, 10240)

cpu.weight.nice [-20, 19] would be already more complicated, 20?

And for cpu.shares [2, 2 << 18] 0 could be used. The issue here is that
WEIGHT_IDLEPRIO (3, 3072) is a valid value already for shares.

> We set the cgroup weight here, since by definition SCHED_IDLE entities
> have the least scheduling weight. From the perspective of your
> question, the analogous statement for tasks would be that we set task
> weight to the min when doing setsched(SCHED_IDLE), even though we
> already have a renice mechanism.

I agree. `cpu.idle = 1` is like setting the task policy to SCHED_IDLE.
And there is even the `cpu.weight.nice` to support the `task - tg`
analogy on nice values.

I'm just wondering if integrating this into `cpu.weight` and friends
would be better to make the code behind this easier to grasp.