Re: [RFC PATCH v2 3/7] sched/uclamp: Introduce root_cfs_util_uclamp for rq

From: Dietmar Eggemann
Date: Wed Mar 20 2024 - 11:28:22 EST


On 19/03/2024 12:50, Hongyan Xia wrote:
> On 18/03/2024 18:21, Dietmar Eggemann wrote:
>> On 01/02/2024 14:11, Hongyan Xia wrote:
>>
>> [...]
>>
>>>       /*
>>>        * The code below (indirectly) updates schedutil which looks at
>>> @@ -6769,6 +6770,10 @@ enqueue_task_fair(struct rq *rq, struct
>>> task_struct *p, int flags)
>>>   #ifdef CONFIG_UCLAMP_TASK
>>>       util_uclamp_enqueue(&rq->cfs.avg, p);
>>>       update_util_uclamp(0, 0, 0, &rq->cfs.avg, p);
>>> +    if (migrated)
>>
>> IMHO, you don't need 'bool __maybe_unused migrated'. You can use:
>>
>>    if (flags & ENQUEUE_MIGRATED)
>
> I'm not sure if they are entirely equivalent. Both
> task_on_rq_migrating() and !task_on_rq_migrating() can have
> last_update_time == 0 but ENQUEUE_MIGRATED flag is only for the former.
> Maybe I'm missing something?

That's true. There are 2:

(!task_on_rq_migrating() && !p->se.avg.last_update_time)

cases:

(1) wakeup migration: ENQUEUE_MIGRATED (0x40) set in ttwu_do_wakeup()

(2) new task: wake_up_new_task() -> activate_task(), ENQUEUE_MIGRATED is
not set.

I assume you want to add the task's util_avg_uclamp to
rq->root_cfs_util_uclamp in (2) too? So:

if (flags & ENQUEUE_MIGRATED || task_new)

[...]

>>>   /* avg must belong to the queue this se is on. */
>>> -void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
>>> +void update_util_uclamp(u64 now,
>>> +            u64 last_update_time,
>>> +            u32 period_contrib,
>>> +            struct sched_avg *avg,
>>> +            struct task_struct *p)
>>>   {
>>
>> I was wondering why you use such a long parameter list for this
>> function.
>>
>> IMHO
>>
>>    update_util_uclamp(u64 now, struct cfs_rq *cfs_rq, struct
>> sched_entity *se)
>>
>> would work as well. You could check whether se represents a task inside
>> update_util_uclamp() as well as get last_update_time and period_contrib.
>>
>> The only reason I see is that you want to use this function for the RT
>> class as well later, where you have to deal with 'struct rt_rq' and
>> 'struct sched_rt_entity'.
>>
>> IMHO, it's always better to keep the implementation to the minimum and
>> only introduce changes which are related to the functionality you
>> present. This would make reviewing so much easier.
>
> Those parameters are necessary because of
>
> if (___update_load_sum()) {
>     ___update_load_avg();
>     update_util_uclamp();
> }

So you need ___update_load_avg() happening for the `normal uclamp path`
and `last_uptade_time` and `period_contrib` for the `decay path` (1) of
update_util_uclamp()?

This is pretty hard to grasp. Isn't there a cleaner solution for this?

Why do you need the:

if (!avg)
return;

thing in update_util_uclamp()? __update_load_avg_blocked_se() calls
update_util_uclamp(..., avg = NULL, ...) but this should follow (1)?

> We have to cache last_update_time and period_contrib, because after
> ___update_load_sum() is done, both parameters in sched_avg have already
> been updated and overwritten and we lose the timestamp when the
> sched_avg was previously updated. update_util_uclamp() needs to know
> when sched_avg was previously updated.
>
>>
>>>       unsigned int util, uclamp_min, uclamp_max;
>>>       int delta;
>>>   -    if (!p->se.on_rq)
>>> +    if (!p->se.on_rq) {
>>> +        ___update_util_uclamp_towards(now,
>>> +                          last_update_time,
>>> +                          period_contrib,
>>> +                          &p->se.avg.util_avg_uclamp,
>>> +                          0);
>>>           return;
>>> +    }
>>
>> You decay 'p->se.avg.util_avg_uclamp' which is not really related to
>> root_cfs_util_uclamp (patch header). IMHO, this would belong to 2/7.
>
> I would say this still belongs to 3/7, because 2/7 only implements
> utilization for on_rq tasks. This patch implements utilization for both
> on_rq and !on_rq. For rq, we have rq->cfs.avg.util_avg_uclamp (for
> on_rq) and rq->root_cfs_util_uclamp (for on_rq plus !on_rq).

Looks like you maintain `rq->cfs.avg.util_avg_uclamp` (2) (consider all
runnable tasks) to be able to:

(a) set rq->root_cfs_util_uclamp (3) to max((3), (2))

(b) check that if 'rq->cfs.h_nr_running == 0' that (2) = 0 too.

uclamp is based on runnable tasks (so enqueue/dequeue) but you uclamp
around util_avg which has contributions from blocked tasks. And that's
why you have to maintain (3). And (3) only decays within
__update_load_avg_cfs_rq().
Can this be the reason why th implementation is so convoluted? It's
definitely more complicated than util_est with its easy layout:

enqueue_task_fair()
util_est_enqueue()

dequeue_task_fair()
util_est_dequeue()
util_est_update()

> For tasks, we also have two utilization numbers, one is on_rq and the
> other is on_rq plus !on_rq. However, we know they do not have to be
> stored in separate variables and util_avg_uclamp can capture both.

Here you lost me. Which value does 'p->se.avg.util_avg_uclamp' store?
'runnable' or 'runnable + blocking'? I would say it's the latter one but
like in PELT we don't update the task signal when it's sleeping.

> Moving this to 2/7 might be fine, although then this would be the only
> !on_rq utilization in 2/7. I can add comments to clarify the situation.
>
>> This is the util_avg_uclamp handling for a se (task):
>>
>> enqueue_task_fair()
>>
>>    util_uclamp_enqueue()
>>
>>    update_util_uclamp()                 (1)
>>
>>      if (!p->se.on_rq)                  (x)
>>        ___update_util_uclamp_towards()  (2)
>>
>> dequeue_task_fair()
>>
>>    util_uclamp_dequeue()
>>
>> __update_load_avg_blocked_se()
>>
>>    update_util_uclamp()
>>
>>      (x)
>>
>> __update_load_avg_se()
>>
>>    update_util_uclamp()
>>
>>      (x)
>>
>> Why is it so unbalanced? Why do you need (1) and (2)?
>>
>> Isn't this just an indication that the se util_avg_uclamp
>> is done at the wrong places?
>>
>> Is there an other way to provide a task/rq signal as the base
>> for uclamp sum aggregation?
>
> (2) won't happen, because at that point p->se.on_rq must be 1.

I see.

> The sequence during enqueue_task_fair() is:
>
> enqueue_task_fair(p)
>     enqueue_entity(se)
>         update_load_avg(se)
>             update_util_uclamp(p) (decay path)
>         p->se.on_rq = 1;
>     util_uclamp_enqueue(p)
>     update_util_uclamp(p) (update path)
>
> The only reason why we want to issue update_util_uclamp() after seeing
> on_rq == 1 is that now it goes down the normal uclamp path and not the
> decay path. Otherwise, uclamp won't immediately engage during enqueue
> and has to wait a timer tick.

OK, I see, the task contribution should be visible immediately after the
enqueue.

> Ideally, we should:
>
> enqueue_task_fair(p)
>     enqueue_entity(se)
>         update_load_avg(se)   
>             util_uclamp_enqueue(p)
>             update_util_uclamp(p) (force update path)
>         p->se.on_rq = 1;
>
> This requires us to invent a new flag to update_util_uclamp() to force
> the update path even when p->se.on_rq is 0.

I guess you have to go this way to achieve a cleaner/easier integration
of 'util_avg_uclamp'.

> At the moment I'm treating util_avg_uclamp in the same way as util_est
> after the comments in v1, which is independent and has its own code
> path. We can go back to the old style, where util_avg_uclamp is closer
> to how we treat se rather than a separate thing like util_est.

Except that 'util_est' integration is much easier to understand. And
this is because of 'util_est' is clear runnable based only and is not
linked to any blocked part.

[...]