Re: [RFC PATCH 1/6] sched/uclamp: Track uclamped util_avg in sched_avg

From: Hongyan Xia
Date: Thu Nov 09 2023 - 11:05:37 EST


On 31/10/2023 15:52, Dietmar Eggemann wrote:
On 04/10/2023 11:04, Hongyan Xia wrote:
From: Hongyan Xia <hongyan.xia2@xxxxxxx>

[...]

@@ -6445,6 +6450,21 @@ static int sched_idle_cpu(int cpu)
}
#endif
+void ___update_util_avg_uclamp(struct sched_avg *avg, struct sched_entity *se);

IMHO, `struct sched_avg *avg` can only be the one of a cfs_rq. So
passing a cfs_rq would eliminate the question whether this can be from
another se.

At the moment, yes, the avg can only come from cfs_rq. The reason why I kept sched_avg is that once we have sum aggregation for RT tasks, it's very likely we will end up calling this function on rt_rq->avg, so having just sched_avg here will work for RT in the future.

+static void update_se_chain(struct task_struct *p)
+{
+#ifdef CONFIG_UCLAMP_TASK
+ struct sched_entity *se = &p->se;
+ struct rq *rq = task_rq(p);
+
+ for_each_sched_entity(se) {
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+ ___update_util_avg_uclamp(&cfs_rq->avg, se);
+ }
+#endif
+}
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -6511,6 +6531,16 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
goto enqueue_throttle;
}
+ /*
+ * Re-evaluate the se hierarchy now that on_rq is true. This is
+ * important to enforce uclamp the moment a task with uclamp is
+ * enqueued, rather than waiting a timer tick for uclamp to kick in.
+ *
+ * XXX: This duplicates some of the work already done in the above for
+ * loops.
+ */
+ update_se_chain(p);

I try to figure out why this is necessary here:

enqueue_task_fair()
for_each_sched_entity()
enqueue_entity()
update_load_avg()
__update_load_avg_se()
___update_util_avg_uclamp() <-- if se->on_rq,
update p (se) if we
cross PELT period (1)
boundaries
___decay_util_avg_uclamp_towards() <-- decay p (se) (2)

enqueue_util_avg_uclamp() <-- enqueue p into cfs_rq (3)

se->on_rq = 1 <-- set p (se) on_rq (4)

for_each_sched_entity()
update_load_avg() <-- update all on_rq se's
in the hierarchy (5)

update_se_chain() <-- update p (se) and its
se hierarchy (6)

(1) Skip p since it is !se->on_rq.

(2) Decay p->se->avg.util_avg_uclamp to 0 since it was sleeping.

(3) Attach p to its cfs_rq

...

(6) Update all all se's and cfs_rq's involved in p's task_group
hierarchy (including propagation from se (level=x+1) to cfs_rq
(level=x))

Question for me is why can't you integrate the util_avg_uclamp signals
for se's and cfs_rq's/rq's much closer into existing PELT functions?

So the problem is that when we enqueue the task (say UCLAMP_MIN of 200), at that exact moment se->on_rq is false, so we only decay and not enforce UCLAMP_MIN. Further up in the hierarchy we do update_load_avg(), but the se of the task has already been processed so UCLAMP_MIN has not taken any effect. To make sure UCLAMP_MIN is immediately effective, I just re-evaluate the whole hierarchy from bottom to top.

I probably didn't quite catch what you said here. Could you elaborate a bit on what 'much closer' means?

Hongyan