Re: [RFC PATCH v2 3/7] sched/uclamp: Introduce root_cfs_util_uclamp for rq

From: Hongyan Xia
Date: Wed Mar 20 2024 - 13:23:21 EST


On 20/03/2024 15:27, Dietmar Eggemann wrote:
On 19/03/2024 12:50, Hongyan Xia wrote:
On 18/03/2024 18:21, Dietmar Eggemann wrote:
On 01/02/2024 14:11, Hongyan Xia wrote:

[...]

      /*
       * The code below (indirectly) updates schedutil which looks at
@@ -6769,6 +6770,10 @@ enqueue_task_fair(struct rq *rq, struct
task_struct *p, int flags)
  #ifdef CONFIG_UCLAMP_TASK
      util_uclamp_enqueue(&rq->cfs.avg, p);
      update_util_uclamp(0, 0, 0, &rq->cfs.avg, p);
+    if (migrated)

IMHO, you don't need 'bool __maybe_unused migrated'. You can use:

   if (flags & ENQUEUE_MIGRATED)

I'm not sure if they are entirely equivalent. Both
task_on_rq_migrating() and !task_on_rq_migrating() can have
last_update_time == 0 but ENQUEUE_MIGRATED flag is only for the former.
Maybe I'm missing something?

That's true. There are 2:

(!task_on_rq_migrating() && !p->se.avg.last_update_time)

cases:

(1) wakeup migration: ENQUEUE_MIGRATED (0x40) set in ttwu_do_wakeup()

(2) new task: wake_up_new_task() -> activate_task(), ENQUEUE_MIGRATED is
not set.

I assume you want to add the task's util_avg_uclamp to
rq->root_cfs_util_uclamp in (2) too? So:

if (flags & ENQUEUE_MIGRATED || task_new)

I see. Maybe we don't need to check last_update_time. I'll take a look.

Although if we want to integrate sum aggregation with se rather than making it fully independent (in your comments below) like util_est, we'll probably just do that in

update_util_avg()
if (!se->avg.last_update_time && (flags & DO_ATTACH)) { ... }

and don't bother doing it in the outside loop of enqueue_task_fair() anyway.

[...]

  /* avg must belong to the queue this se is on. */
-void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
+void update_util_uclamp(u64 now,
+            u64 last_update_time,
+            u32 period_contrib,
+            struct sched_avg *avg,
+            struct task_struct *p)
  {

I was wondering why you use such a long parameter list for this
function.

IMHO

   update_util_uclamp(u64 now, struct cfs_rq *cfs_rq, struct
sched_entity *se)

would work as well. You could check whether se represents a task inside
update_util_uclamp() as well as get last_update_time and period_contrib.

The only reason I see is that you want to use this function for the RT
class as well later, where you have to deal with 'struct rt_rq' and
'struct sched_rt_entity'.

IMHO, it's always better to keep the implementation to the minimum and
only introduce changes which are related to the functionality you
present. This would make reviewing so much easier.

Those parameters are necessary because of

if (___update_load_sum()) {
    ___update_load_avg();
    update_util_uclamp();
}

So you need ___update_load_avg() happening for the `normal uclamp path`
and `last_uptade_time` and `period_contrib` for the `decay path` (1) of
update_util_uclamp()?

This is pretty hard to grasp. Isn't there a cleaner solution for this?

You are correct. Not sure if there's a better way.

Why do you need the:

if (!avg)
return;

thing in update_util_uclamp()? __update_load_avg_blocked_se() calls
update_util_uclamp(..., avg = NULL, ...) but this should follow (1)?

I added it as a guard to rule out edge cases, but I think when you do __update_load_avg_blocked_se(), it has to be !on_rq so it will never enter this path. I think it doesn't hurt but I can remove it.

We have to cache last_update_time and period_contrib, because after
___update_load_sum() is done, both parameters in sched_avg have already
been updated and overwritten and we lose the timestamp when the
sched_avg was previously updated. update_util_uclamp() needs to know
when sched_avg was previously updated.


      unsigned int util, uclamp_min, uclamp_max;
      int delta;
  -    if (!p->se.on_rq)
+    if (!p->se.on_rq) {
+        ___update_util_uclamp_towards(now,
+                          last_update_time,
+                          period_contrib,
+                          &p->se.avg.util_avg_uclamp,
+                          0);
          return;
+    }

You decay 'p->se.avg.util_avg_uclamp' which is not really related to
root_cfs_util_uclamp (patch header). IMHO, this would belong to 2/7.

I would say this still belongs to 3/7, because 2/7 only implements
utilization for on_rq tasks. This patch implements utilization for both
on_rq and !on_rq. For rq, we have rq->cfs.avg.util_avg_uclamp (for
on_rq) and rq->root_cfs_util_uclamp (for on_rq plus !on_rq).

Looks like you maintain `rq->cfs.avg.util_avg_uclamp` (2) (consider all
runnable tasks) to be able to:

(a) set rq->root_cfs_util_uclamp (3) to max((3), (2))

(b) check that if 'rq->cfs.h_nr_running == 0' that (2) = 0 too.

uclamp is based on runnable tasks (so enqueue/dequeue) but you uclamp
around util_avg which has contributions from blocked tasks. And that's
why you have to maintain (3). And (3) only decays within
__update_load_avg_cfs_rq().
Can this be the reason why th implementation is so convoluted? It's
definitely more complicated than util_est with its easy layout:

enqueue_task_fair()
util_est_enqueue()

dequeue_task_fair()
util_est_dequeue()
util_est_update()
There's only one rule here, which is the root value must be the sum of all task util_avg_uclamp values. If PELT slowly decays each util_avg, then the sum will also decay in a similar fashion. The code here is just doing that.

This 'convoluted' property is from PELT and not from sum aggregation. Actually this is what PELT used to do a while back, when the root CFS util was the sum of all tasks instead of being tracked separately, and we do the same decay here as we did back then.

You are right that this feels convoluted, but sum aggregation doesn't attempt to change how PELT works so the decaying property is there due to PELT. I can keep exploring new ways to make the logic easier to follow.

For tasks, we also have two utilization numbers, one is on_rq and the
other is on_rq plus !on_rq. However, we know they do not have to be
stored in separate variables and util_avg_uclamp can capture both.

Here you lost me. Which value does 'p->se.avg.util_avg_uclamp' store?
'runnable' or 'runnable + blocking'? I would say it's the latter one but
like in PELT we don't update the task signal when it's sleeping.

The latter. We don't update the task signal when it's sleeping but we do when we need to use it, and that's enough. This is also the case for all blocked util_avg.

Moving this to 2/7 might be fine, although then this would be the only
!on_rq utilization in 2/7. I can add comments to clarify the situation.

This is the util_avg_uclamp handling for a se (task):

enqueue_task_fair()

   util_uclamp_enqueue()

   update_util_uclamp()                 (1)

     if (!p->se.on_rq)                  (x)
       ___update_util_uclamp_towards()  (2)

dequeue_task_fair()

   util_uclamp_dequeue()

__update_load_avg_blocked_se()

   update_util_uclamp()

     (x)

__update_load_avg_se()

   update_util_uclamp()

     (x)

Why is it so unbalanced? Why do you need (1) and (2)?

Isn't this just an indication that the se util_avg_uclamp
is done at the wrong places?

Is there an other way to provide a task/rq signal as the base
for uclamp sum aggregation?

(2) won't happen, because at that point p->se.on_rq must be 1.

I see.

The sequence during enqueue_task_fair() is:

enqueue_task_fair(p)
    enqueue_entity(se)
        update_load_avg(se)
            update_util_uclamp(p) (decay path)
        p->se.on_rq = 1;
    util_uclamp_enqueue(p)
    update_util_uclamp(p) (update path)

The only reason why we want to issue update_util_uclamp() after seeing
on_rq == 1 is that now it goes down the normal uclamp path and not the
decay path. Otherwise, uclamp won't immediately engage during enqueue
and has to wait a timer tick.

OK, I see, the task contribution should be visible immediately after the
enqueue.

Ideally, we should:

enqueue_task_fair(p)
    enqueue_entity(se)
        update_load_avg(se)
            util_uclamp_enqueue(p)
            update_util_uclamp(p) (force update path)
        p->se.on_rq = 1;

This requires us to invent a new flag to update_util_uclamp() to force
the update path even when p->se.on_rq is 0.

I guess you have to go this way to achieve a cleaner/easier integration
of 'util_avg_uclamp'.

If we don't want to keep util_avg_uclamp separately like util_est but move it closer to se, then we can explore this option.

At the moment I'm treating util_avg_uclamp in the same way as util_est
after the comments in v1, which is independent and has its own code
path. We can go back to the old style, where util_avg_uclamp is closer
to how we treat se rather than a separate thing like util_est.

Except that 'util_est' integration is much easier to understand. And
this is because of 'util_est' is clear runnable based only and is not
linked to any blocked part.

True.

Well, I can go back to the style in RFC v1. One big advantage of v2 is that we can compile out the support of uclamp very easily because it's treated more or less like an independent thing.

[...]