Re: [RFC PATCH v2 1/7] Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0"

From: Hongyan Xia
Date: Tue Mar 19 2024 - 13:05:48 EST


On 19/03/2024 15:34, Dietmar Eggemann wrote:
On 01/02/2024 14:11, Hongyan Xia wrote:
From: Hongyan Xia <Hongyan.Xia2@xxxxxxx>

That commit creates further problems because 0 spare capacity can be
either a real indication that the CPU is maxed out, or the CPU is
UCLAMP_MAX throttled, but we end up giving all of them a chance which
can results in bogus energy calculations. It also tends to schedule
tasks on the same CPU and requires load balancing patches. Sum
aggregation solves these problems and this patch is not needed.

This reverts commit 6b00a40147653c8ea748e8f4396510f252763364.

I assume you did this revert especially for the 'Scenario 5: 8 tasks
with UCLAMP_MAX of 120' testcase?

More or less. Actually you can already see the problem in Scenario 1. Ideally the 4 uclamp_max tasks should be evenly distributed on 4 little CPUs, but from time to time task placement places more than 1 such task on the same CPU, leaving some other little CPUs not occupied.

IMHO, the issue is especially visible in compute_energy()'s busy_time
computation with a valid destination CPU (dst_cpu >= 0). I.e. when we
have to add performance domain (pd) and task busy time.

find_energy_efficient_cpu() (feec())

for each pd
for each cpu in pd

set {prev_,max}_spare_cap

bail if prev_ and max_spare_cap < 0 (was == 0 before )

{base_,prev_,cur_}energy = compute_energy

So with the patch we potentially compute energy for a saturated PD
according:

compute_energy()

if (dst_cpu >= 0)
busy_time = min(eenv->pd_cap, eenv->busy_time + eenv->task_busy_time)
<----(a)---> <--------------(b)------------------->

energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap)

If (b) > (a) then we're saturated and 'energy' is bogus.

Yeah, I think what's happening is because placing more tasks on the same CPU won't increase energy computation, so in the end task placement thinks it's the better decision. The root issue is that once you have uclamp_max, you can theoretically fit an infinite number of such tasks on the same CPU.

The way to fix this is up for discussion:

(1) feec() returning prev_cpu
(2) feec() returning -1 (forcing wakeup into sis() -> sic())
(3) using uclamped values for task and rq utilization

None of those have immediately given the desired task placement on
mainline (2 tasks on each of the 4 little CPUs and no task on the 2 big
CPUs on my [l B B l l l] w/ CPU capacities = [446 1024 1024 446 446 446]
machine) you can achieve with uclamp sum aggregation.

Personally from the results I've seen I definitely prefer (3), although (3) has other problems. One thing is that sum aggregation pushes up utilization with uclamp_min, but its energy consumption definitely won't be that high. The real energy is between its util_avg and util_avg_uclamp.

I haven't seen this as a real problem, but maybe we can see even better task placement if this is accounted for.

[...]