Re: [PATCH v2 0/8] uclamp sum aggregation

From: Hongyan Xia
Date: Thu Mar 06 2025 - 06:40:45 EST


Hi Xuewen,

On 06/03/2025 11:12, Xuewen Yan wrote:
Hi Hongyan,

On Tue, Mar 4, 2025 at 10:26 PM Hongyan Xia <hongyan.xia2@xxxxxxx> wrote:

This series gives an alternative implementation that addresses some of
the problems in uclamp max aggregation. Sum aggregation mostly gives:

1. Simplicity. Sum aggregation implements uclamp with less than half of
code than max aggregation.
2. Effectiveness. Sum aggregation shows better uclamp effectiveness,
either in benchmark scores or sometimes in energy efficiency.

The key idea of sum aggregation is fairly simple. Each task has a
util_avg_bias, which is obtained by:

util_avg_bias = clamp(util_avg, uclamp_min, uclamp_max) - util_avg;

If a CPU has N tasks, p1, p2, p3... pN, then we sum the biases up and
obtain a rq total bias:

rq_bias = util_avg_bias1 + util_avg_bias2... + util_avg_biasN;

Then we use the biased rq utilization rq_util + rq_bias to select OPP
and to schedule tasks.

PATCH BREAKDOWN:

Patch 1/6 reverts a patch that accommodate uclamp_max tasks under max
aggregation. This patch is not needed and creates other problems for sum
aggregation. It is discussed elsewhere that this patch will be improved
and there may not be the need to revert it in the future.

Patch 2, 3 and 4 implement sum aggregation.

Patch 5 and 6 remove max aggregation.

Patch 7 applies PELT decay on negative util_avg_bias. This improves
energy efficiency and task placement, but is not strictly necessary.

Patch 8 addresses sum aggregation under-utilization problem.

TESTING:

Two notebooks are shared at

https://nbviewer.org/github/honxia02/notebooks/blob/aac12d9becae2b2fe4690cbb672439fd884ede30/whitebox/max.ipynb
https://nbviewer.org/github/honxia02/notebooks/blob/aac12d9becae2b2fe4690cbb672439fd884ede30/whitebox/sum-offset.ipynb

The experiments done in notebooks are on Arm Juno r2 board. CPU0-3 are
little cores with capacity of 383. CPU4-5 are big cores. The rt-app
profiles used for these experiments are included in the notebooks.

Scenario 1: Scheduling 4 tasks with UCLAMP_MAX at 110.

The scheduling decisions are plotted in Out[11]. Both max and sum
aggregation understand the UCLAMP_MAX hint and schedule all 4 tasks on
the little cluster. Max aggregation sometimes schedule 2 tasks on 1 CPU,
and this is the reason why sum aggregation reverts the 1st commit.

Scenario 2: Scheduling 4 tasks with UCLAMP_MIN and UCLAMP_MAX at a value
slightly above the capacity of the little CPU.

Results are in Out[17]. The purpose is to use UCLAMP_MIN to place tasks
on the big core. Both max and sum aggregation handle this correctly.

Scenario 3: Task A is a task with a small utilization pinned to CPU4.
Task B is an always-running task pinned to CPU5, but UCLAMP_MAX capped
at 300. After a while, task A is then pinned to CPU5, joining B.

Results are in Out[23]. Max aggregation sees a frequency spike at
873.64s. When zoomed in, one can see square-wave-like utilization values
because of A periodically going to sleep. When A wakes up, its default
UCLAMP_MAX of 1024 will uncap B and reach the highest CPU frequency.
When A sleeps, B's UCLAMP_MAX will be in effect and will reduce rq
utilization. This happens repeatedly, hence the square wave. In
contrast, sum aggregation sees a normal increase in utilization when A
joins B, without any square-wave behavior.

Scenario 4: 4 always-running tasks with UCLAMP_MAX of 110 pinned to the
little PD (CPU0-3). 4 same tasks pinned to the big PD (CPU4-5).
After a while, remove the CPU pinning of the 4 tasks on the big PD.

Results are in Out[29]. After unpinning, max aggregation moves all 8
tasks to the little cluster, but schedules 5 tasks on CPU0 and 1 each on
CPU1-3. In contrast, sum aggregation schedules 2 on each little CPU
after unpinning, which is the desired balanced task placement.

EVALUATION:

We backport patches to GKI kernel v6.1 on Pixel 9 and run Android
benchmarks.

Speedometer:

We run Speedometer 2.1 on Chrome v131 to test ADPF/uclamp effectiveness.
Because sum aggregation does not circumvent the 25% OPP margin, we
reduce uclamp values to 80% to be fair.

| score | score | % | CPU power % |
| max | 192.4 | | |
| sum_0.8 | 230.8 | +19.96 | +31.54 |
| sum_tuned | 201.8 | +4.89 | -0.41 |

We see a consistant higher score and higher average power consumption.
Note that a higher score also means a reduction in run-time, total
energy increase for sum_0.8 is only 9.65%.

We then reduce uclamp values so that power consumption is roughly
the same. If we do so, then sum aggregation achieves slightly better
scores, shown in the sum_tuned row.

UIBench:

| score | jank percentage | % | CPU power (mW) | % |
| max | 0.115% | | 158.1 | |
| sum_0.8 | 0.129% | +11.96 | 154.9 | -4.19 |

UIBench on Pixel 9 by default already has a low enough jank percentage.
Moving to sum aggregation gives slightly higher jank percentage and
lower power consumption.

---
Changed in v2:
- Completely separate uclamp component from PELT and util_est.
- Separate util_est_uclamp into an individual patch.
- Address the under-utilization problem.
- Update Python notebooks to reflect the latest sched/tip.

Hongyan Xia (8):
Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is
0"
sched/uclamp: Track a new util_avg_bias signal
sched/uclamp: Add util_est_uclamp
sched/fair: Use util biases for utilization and frequency
sched/uclamp: Remove all uclamp bucket logic

I’ve recently been looking into the issue with uclamp and
delayed-dequeue, and I found that uclamp_rq_inc should be placed
before enqueue_task, which led to a patch.
Before sending the patch, I came across your series of patches. I
haven’t fully understood your patch yet, but it seems like
uclamp_rq_inc is no longer needed.
Do you think the patch below is still necessary?


I posted a fix of the issue you mentioned days ago here

https://lore.kernel.org/lkml/84441660bef0a5e67fd09dc3787178d0276dad31.1740664400.git.hongyan.xia2@xxxxxxx/

I think we found the same issue, but our approaches are different. I fear that as more complexity goes into each sched_class like delayed dequeue, it's better to just let the sched_class handle how uclamp is enqueued and dequeued within itself rather than leaking into core.c. Would be nice if you could take a look at my fix.

Your patch is definitely necessary. The thing with this uclamp sum aggregation series is that there are still debates around it and it might take a while before everything is settled. So, at the moment we should view this series and the uclamp enqueue fix as separate things.

--->

Subject: [PATCH] sched/uclamp: Update the rq's uclamp before enqueue task

When task's uclamp is set, we hope that the CPU frequency
can increase as quickly as possible when the task is enqueued.
Because the cpu frequency updating happens during the enqueue_task(),
so the rq's uclamp needs to be updated before the task is enqueued.
For sched-delayed tasks, the rq uclamp should only be updated
when they are enqueued upon being awakened.

Signed-off-by: Xuewen Yan <xuewen.yan@xxxxxxxxxx>
---
kernel/sched/core.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67189907214d..b07e78910221 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1747,7 +1747,7 @@ static inline void uclamp_rq_dec_id(struct rq
*rq, struct task_struct *p,
}
}

-static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
+static inline void uclamp_rq_inc(struct rq *rq, struct task_struct
*p, int flags)
{
enum uclamp_id clamp_id;

@@ -1763,7 +1763,8 @@ static inline void uclamp_rq_inc(struct rq *rq,
struct task_struct *p)
if (unlikely(!p->sched_class->uclamp_enabled))
return;

- if (p->se.sched_delayed)
+ /* Only inc the delayed task which is being woken up. */
+ if (p->se.sched_delayed && !(flags & ENQUEUE_DELAYED))
return;

for_each_clamp_id(clamp_id)
@@ -2031,7 +2032,7 @@ static void __init init_uclamp(void)
}

#else /* !CONFIG_UCLAMP_TASK */
-static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_rq_inc(struct rq *rq, struct task_struct
*p, int flags) { }
static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
static inline void uclamp_fork(struct task_struct *p) { }
static inline void uclamp_post_fork(struct task_struct *p) { }
@@ -2067,12 +2068,9 @@ void enqueue_task(struct rq *rq, struct
task_struct *p, int flags)
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);

+ uclamp_rq_inc(rq, p, flags);
+
p->sched_class->enqueue_task(rq, p, flags);
- /*
- * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
- * ->sched_delayed.
- */
- uclamp_rq_inc(rq, p);

psi_enqueue(p, flags);

--

Thanks!

BR
---

sched/uclamp: Simplify uclamp_eff_value()
sched/uclamp: Propagate negative bias
sched/uclamp: Solve under-utilization problem

include/linux/sched.h | 8 +-
init/Kconfig | 32 ---
kernel/sched/core.c | 308 ++--------------------
kernel/sched/cpufreq_schedutil.c | 6 +-
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 430 ++++++++++++++++---------------
kernel/sched/pelt.c | 62 +++++
kernel/sched/rt.c | 4 -
kernel/sched/sched.h | 132 +++-------
kernel/sched/syscalls.c | 2 +
10 files changed, 341 insertions(+), 645 deletions(-)

--
2.34.1