Re: [RFC PATCH v2 3/7] sched/uclamp: Introduce root_cfs_util_uclamp for rq

From: Hongyan Xia
Date: Tue Mar 19 2024 - 07:50:29 EST


On 18/03/2024 18:21, Dietmar Eggemann wrote:
On 01/02/2024 14:11, Hongyan Xia wrote:

[...]

/*
* The code below (indirectly) updates schedutil which looks at
@@ -6769,6 +6770,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
#ifdef CONFIG_UCLAMP_TASK
util_uclamp_enqueue(&rq->cfs.avg, p);
update_util_uclamp(0, 0, 0, &rq->cfs.avg, p);
+ if (migrated)

IMHO, you don't need 'bool __maybe_unused migrated'. You can use:

if (flags & ENQUEUE_MIGRATED)

I'm not sure if they are entirely equivalent. Both task_on_rq_migrating() and !task_on_rq_migrating() can have last_update_time == 0 but ENQUEUE_MIGRATED flag is only for the former. Maybe I'm missing something?

+ rq->root_cfs_util_uclamp += p->se.avg.util_avg_uclamp;
+ rq->root_cfs_util_uclamp = max(rq->root_cfs_util_uclamp,
+ rq->cfs.avg.util_avg_uclamp);
/* TODO: Better skip the frequency update in the for loop above. */
cpufreq_update_util(rq, 0);
#endif
@@ -8252,6 +8257,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
migrate_se_pelt_lag(se);
}
+ remove_root_cfs_util_uclamp(p);

You can't always do this here. In the '!task_on_rq_migrating()' case we
don't hold the 'old' rq->lock.

Have a look into remove_entity_load_avg() for what we do for PELT in
this case.

And:

144d8487bc6e ("sched/fair: Implement synchonous PELT detach on load-balance migrate")
e1f078f50478 ("sched/fair: Combine detach into dequeue when migrating task")

@@ -3081,6 +3081,8 @@ static inline void remove_root_cfs_util_uclamp(struct task_struct *p)
unsigned int root_util = READ_ONCE(rq->root_cfs_util_uclamp);
unsigned int p_util = READ_ONCE(p->se.avg.util_avg_uclamp), new_util;
+ lockdep_assert_rq_held(task_rq(p));
+
new_util = (root_util > p_util) ? root_util - p_util : 0;
new_util = max(new_util, READ_ONCE(rq->cfs.avg.util_avg_uclamp));
WRITE_ONCE(rq->root_cfs_util_uclamp, new_util);

Ack. I saw the removed_* functions. I will change into that style.

[...]

/* avg must belong to the queue this se is on. */
-void update_util_uclamp(struct sched_avg *avg, struct task_struct *p)
+void update_util_uclamp(u64 now,
+ u64 last_update_time,
+ u32 period_contrib,
+ struct sched_avg *avg,
+ struct task_struct *p)
{

I was wondering why you use such a long parameter list for this
function.

IMHO

update_util_uclamp(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)

would work as well. You could check whether se represents a task inside
update_util_uclamp() as well as get last_update_time and period_contrib.

The only reason I see is that you want to use this function for the RT
class as well later, where you have to deal with 'struct rt_rq' and
'struct sched_rt_entity'.

IMHO, it's always better to keep the implementation to the minimum and
only introduce changes which are related to the functionality you
present. This would make reviewing so much easier.

Those parameters are necessary because of

if (___update_load_sum()) {
___update_load_avg();
update_util_uclamp();
}

We have to cache last_update_time and period_contrib, because after ___update_load_sum() is done, both parameters in sched_avg have already been updated and overwritten and we lose the timestamp when the sched_avg was previously updated. update_util_uclamp() needs to know when sched_avg was previously updated.


unsigned int util, uclamp_min, uclamp_max;
int delta;
- if (!p->se.on_rq)
+ if (!p->se.on_rq) {
+ ___update_util_uclamp_towards(now,
+ last_update_time,
+ period_contrib,
+ &p->se.avg.util_avg_uclamp,
+ 0);
return;
+ }

You decay 'p->se.avg.util_avg_uclamp' which is not really related to
root_cfs_util_uclamp (patch header). IMHO, this would belong to 2/7.

I would say this still belongs to 3/7, because 2/7 only implements utilization for on_rq tasks. This patch implements utilization for both on_rq and !on_rq. For rq, we have rq->cfs.avg.util_avg_uclamp (for on_rq) and rq->root_cfs_util_uclamp (for on_rq plus !on_rq).

For tasks, we also have two utilization numbers, one is on_rq and the other is on_rq plus !on_rq. However, we know they do not have to be stored in separate variables and util_avg_uclamp can capture both.

Moving this to 2/7 might be fine, although then this would be the only !on_rq utilization in 2/7. I can add comments to clarify the situation.

This is the util_avg_uclamp handling for a se (task):

enqueue_task_fair()

util_uclamp_enqueue()

update_util_uclamp() (1)

if (!p->se.on_rq) (x)
___update_util_uclamp_towards() (2)

dequeue_task_fair()

util_uclamp_dequeue()

__update_load_avg_blocked_se()

update_util_uclamp()

(x)

__update_load_avg_se()

update_util_uclamp()

(x)

Why is it so unbalanced? Why do you need (1) and (2)?

Isn't this just an indication that the se util_avg_uclamp
is done at the wrong places?

Is there an other way to provide a task/rq signal as the base
for uclamp sum aggregation?

(2) won't happen, because at that point p->se.on_rq must be 1.

The sequence during enqueue_task_fair() is:

enqueue_task_fair(p)
enqueue_entity(se)
update_load_avg(se)
update_util_uclamp(p) (decay path)
p->se.on_rq = 1;
util_uclamp_enqueue(p)
update_util_uclamp(p) (update path)

The only reason why we want to issue update_util_uclamp() after seeing on_rq == 1 is that now it goes down the normal uclamp path and not the decay path. Otherwise, uclamp won't immediately engage during enqueue and has to wait a timer tick.

Ideally, we should:

enqueue_task_fair(p)
enqueue_entity(se)
update_load_avg(se)
util_uclamp_enqueue(p)
update_util_uclamp(p) (force update path)
p->se.on_rq = 1;

This requires us to invent a new flag to update_util_uclamp() to force the update path even when p->se.on_rq is 0.

At the moment I'm treating util_avg_uclamp in the same way as util_est after the comments in v1, which is independent and has its own code path. We can go back to the old style, where util_avg_uclamp is closer to how we treat se rather than a separate thing like util_est.

[...]