Re: [PATCH v2] sched/uclamp: Align uclamp and util_est and call before freq update

From: K Prateek Nayak
Date: Wed Mar 26 2025 - 00:37:41 EST


Hello Xuewen,

On 3/26/2025 8:27 AM, Xuewen Yan wrote:
Hi Prateek,

On Wed, Mar 26, 2025 at 12:54 AM K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:

Hello Xuewen,

On 3/25/2025 7:17 AM, Xuewen Yan wrote:
When task's uclamp is set, we hope that the CPU frequency
can increase as quickly as possible when the task is enqueued.
Because the cpu frequency updating happens during the enqueue_task(),
so the rq's uclamp needs to be updated before the task is enqueued,
just like util_est.

I thought the frequency ramp up / ramp down was a problem with
delayed tasks being requeued.

So, aline the uclamp and util_est and call before freq update.

For sched-delayed tasks, the rq uclamp/util_est should only be updated
when they are enqueued upon being awakened.
So simply the logic of util_est's enqueue/dequeue check.

Signed-off-by: Xuewen Yan <xuewen.yan@xxxxxxxxxx>

[..snip..]

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c798d2795243..c92fee07fb7b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6930,7 +6930,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
* Let's add the task's estimated utilization to the cfs_rq's
* estimated utilization, before we update schedutil.
*/
- if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & ENQUEUE_RESTORE))))
+ if (!p->se.sched_delayed || (flags & ENQUEUE_DELAYED))
util_est_enqueue(&rq->cfs, p);

Wouldn't this do a util_est_{dequeue,enqueue}() for a save restore
operation too of a non-delayed task? Is that desired?

For delayed-task, its util_est should dequeue/enqueue only for its
sleeping and waking up,
For the save restore operation, there is no need to enqueue it,
because it is not woken up.
So the condition of enqueue actually is:
if (!p->se.sched_delayed || (p->se.sched_delayed && (flags & ENQUEUE_DELAYED)))
And, this is equal to :
if (!p->se.sched_delayed || (flags & ENQUEUE_DELAYED))

More details here:
https://lore.kernel.org/all/84441660bef0a5e67fd09dc3787178d0276dad31.1740664400.git.hongyan.xia2@xxxxxxx/T/#ma2505e90489316eb354390b42dee9d053f6fd1e9


Ah! Correct! I got my "&&"s and "||"s confused. Sorry about that.


On a larger note ...

An enqueue of a delayed task will call requeue_delayed_entity() which
will only enqueue p->se on its cfs_rq and do an update_load_avg() for
that cfs_rq alone.

With cgroups enabled, this cfs_rq might not be the root cfs_rq and
cfs_rq_util_change() will not call cpufreq_update_util() leaving the
CPU running at the older frequency despite the updated uclamp
constraints.

If think cfs_rq_util_change() should be called for the root cfs_rq
when a task is delayed or when it is re-enqueued to re-evaluate
the uclamp constraints.

I think you're referring to a different issue with the delayed-task's
util_ets/uclamp.
This issue is unrelated to util-est and uclamp, because even without
these two features, the problem you're mentioning still exists.
Specifically, if the delayed-task is not the root CFS task, the CPU
frequency might not be updated in time when the delayed-task is
enqueued.
Maybe we could add the update_load_avg() in clear_delayed to solve the issue?

I thought something like:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a0c4cd26ee07..007b0bb91529 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5473,6 +5473,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (sched_feat(DELAY_DEQUEUE) && delay &&
!entity_eligible(cfs_rq, se)) {
update_load_avg(cfs_rq, se, 0);
+ /* Reevaluate frequency since uclamp may have changed */
+ if (cfs_rq != rq->cfs)
+ cfs_rq_util_change(rq->cfs, 0);
set_delayed(se);
return false;
}
@@ -6916,6 +6919,9 @@ requeue_delayed_entity(struct sched_entity *se)
}
update_load_avg(cfs_rq, se, 0);
+ /* Reevaluate frequency since uclamp may have changed */
+ if (cfs_rq != rq->cfs)
+ cfs_rq_util_change(rq->cfs, 0);
clear_delayed(se);
}
---

to ensure that schedutil knows about any changes in the uclamp
constraints at the first dequeue, at reenqueue.


-->8--

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a0c4cd26ee07..c75d50dab86b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5435,6 +5435,7 @@ static void clear_delayed(struct sched_entity *se)
for_each_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);

+ update_load_avg(cfs_rq, se, UPDATE_TG);

For finish_delayed_dequeue_entity() calling into clear_delayed(),
UPDATE_TG would be done already in dequeue_entity().

For requeue, I believe the motivation to skip UPDATE_TG was for
the entity to compete with its original weight to be picked off
later.

cfs_rq->h_nr_runnable++;
if (cfs_rq_throttled(cfs_rq))
break;

---

BR
xuewen

--
Thanks and Regards,
Prateek