On 04/10/2023 11:04, Hongyan Xia wrote:
From: Hongyan Xia <hongyan.xia2@xxxxxxx>
Because util_avg_uclamp is not directly managed by PELT, it lacks the
nice property of slowly decaying to a lower value, resulting in
performance degredation due to premature frequency drops.
Add functions to decay root cfs utilization and tasks that are not on
the rq. This way, we get the benefits of PELT while still maintaining
uclamp. The rules are simple:
1. When task is se->on_rq, enforce its util_avg_uclamp within uclamp
range.
2. When task is !se->on_rq, PELT decay its util_avg_uclamp.
3. When the root CFS util drops, PELT decay to the target frequency
instead of immediately dropping to a lower target frequency.
TODO: Can we somehow integrate this uclamp sum aggregation directly into
util_avg, so that we don't need to introduce a new util_avg_uclamp
signal and don't need to simulate PELT decay?
That's a good question. I'm wondering why you were not able to integrate
the maintenance of the util_avg_uclamp values inside the existing PELT
update functionality in fair.c ((__update_load_avg_xxx(),
propagate_entity_load_avg() -> update_tg_cfs_util() etc.)
Why do you need extra functions like ___decay_util_avg_uclamp_towards()
and ___update_util_avg_uclamp() for this?
Signed-off-by: Hongyan Xia <hongyan.xia2@xxxxxxx>
---
kernel/sched/fair.c | 20 +++++++++
kernel/sched/pelt.c | 103 ++++++++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 2 +
3 files changed, 119 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33e5a6e751c0..420af57d01ee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4311,17 +4311,22 @@ static inline int
update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
{
unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
+ unsigned int removed_root_util = 0;
unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
-unsigned int removed_root_util = 0;
+unsigned int __maybe_unused removed_root_util = 0;
Otherwise you get `warning: unused variable ‘rq’` w/ !CONFIG_UCLAMP_TASK
[...]
#ifdef CONFIG_UCLAMP_TASK
+static void ___decay_util_avg_uclamp_towards(u64 now,
+ u64 last_update_time,
+ u32 period_contrib,
+ unsigned int *old,
+ unsigned int new_val)
+{
+ unsigned int old_val = READ_ONCE(*old);
+ u64 delta, periods;
+
+ if (old_val <= new_val) {
+ WRITE_ONCE(*old, new_val);
+ return;
+ }
Why is the function called `decay`? In case `new >= old` you set old =
new and bail out. So it's also more like an `update` function?
+ if (!last_update_time)
+ return;
+ delta = now - last_update_time;
+ if ((s64)delta < 0)
+ return;
+ delta >>= 10;
+ if (!delta)
+ return;
+
+ delta += period_contrib;
+ periods = delta / 1024;
+ if (periods) {
+ u64 diff = old_val - new_val;
+
+ /*
+ * Let's assume 3 tasks, A, B and C. A is still on rq but B and
+ * C have just been dequeued. The cfs.avg.util_avg_uclamp has
+ * become A but root_cfs_util_uclamp just starts to decay and is
+ * now still A + B + C.
+ *
+ * After p periods with y being the decay factor, the new
+ * root_cfs_util_uclamp should become
+ *
+ * A + B * y^p + C * y^p == A + (A + B + C - A) * y^p
+ * == cfs.avg.util_avg_uclamp +
+ * (root_cfs_util_uclamp_at_the_start - cfs.avg.util_avg_uclamp) * y^p
+ * == cfs.avg.util_avg_uclamp + diff * y^p
+ *
+ * So, instead of summing up each individual decayed values, we
+ * could just decay the diff and not bother with the summation
+ * at all. This is why we decay the diff here.
+ */
+ diff = decay_load(diff, periods);
+ WRITE_ONCE(*old, new_val + diff);
+ }
+}
Looks like ___decay_util_avg_uclamp_towards() is used for:
(1) tasks with !se->on_rq to decay before enqueue
(2) rq->root_cfs_util_uclamp to align with
&rq_of(cfs_rq)->cfs->avg.util_avg_uclamp
All the cfs_rq's and the taskgroup se's seem to be updated only in
___update_util_avg_uclamp() (which also handles the propagation towards
the root taskgroup).
[...]