Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0

From: Lukasz Luba
Date: Mon May 22 2023 - 04:31:26 EST


Hi Qais,

I have a question regarding the 'soft cpu affinity'.

On 2/11/23 17:50, Qais Yousef wrote:
On 02/09/23 19:02, Dietmar Eggemann wrote:
On 07/02/2023 10:45, Vincent Guittot wrote:
On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:

When uclamp_max is being used, the util of the task could be higher than
the spare capacity of the CPU, but due to uclamp_max value we force fit
it there.

The way the condition for checking for max_spare_cap in
find_energy_efficient_cpu() was constructed; it ignored any CPU that has
its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
hence ending up never performing compute_energy() for this cluster and
missing an opportunity for a better energy efficient placement to honour
uclamp_max setting.

max_spare_cap = 0;
cpu_cap = capacity_of(cpu) - task_util(p); // 0 if task_util(p) is high

...

util_fits_cpu(...); // will return true if uclamp_max forces it to fit

s/true/1/ ?


...

// this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
if (cpu_cap > max_spare_cap) {
max_spare_cap = cpu_cap;
max_spare_cap_cpu = cpu;
}

prev_spare_cap suffers from a similar problem.

Fix the logic by converting the variables into long and treating -1
value as 'not populated' instead of 0 which is a viable and correct
spare capacity value.

The issue I see here is in case we don't have any spare capacity left,
the energy calculation (in terms CPU utilization) isn't correct anymore.

Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()`
you never know how big the `busy_time` for the PD really is in this moment.

eenv_pd_busy_time()

for_each_cpu(cpu, pd_cpus)
busy_time += effective_cpu_util(..., ENERGY_UTIL, ...)
^^^^^^^^^

with:

sum_util = min(busy_time + task_busy_time, pd_cap)
^^^^^^^^^

freq = (1.25 * max_util / max) * max_freq

energy = (perf_state(freq)->cost / max) * sum_util


energy is not related to CPU utilization anymore (since there is no idle
time/spare capacity) left.

Am I right that what you're saying is that the energy calculation for the PD
will be capped to a certain value and this is why you think the energy is
incorrect?

What should be the correct energy (in theory at least)?


So EAS keeps packing on the cheaper PD/clamped OPP.

Which is the desired behavior for uclamp_max?

The only issue I see is that we want to distribute within a pd. Which is
something I was going to work on and send after later - but can lump it in this
series if it helps.


E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440
tasks all running on little PD, cpumask=0x39. The big PD is idle but
never beats prev_cpu in terms of energy consumption for the wakee.

IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max
is two folds:

1. Prevent tasks from consuming energy.
2. Keep them away from expensive CPUs.

2 is actually very important for 2 reasons:

a. Because of max aggregation - any uncapped tasks that wakes up will
cause a frequency spike on this 'expensive' cpu. We don't have
a mechanism to downmigrate it - which is another thing I'm working
on.
b. It is desired to keep these bigger cpu idle ready for more important
work.

For 2, generally we don't want these tasks to steal bandwidth from these CPUs
that we'd like to preserve for other type of work.

I'm a bit afraid about such 'strong force'. That means the task would
not go via EAS if we set uclamp_max e.g. 90, while the little capacity
is 125. Or am I missing something?

This might effectively use more energy for those tasks which can run on
any CPU and EAS would figure a good energy placement. I'm worried
about this, since we have L3+littles in one DVFS domain and the L3
would be only bigger in future.

IMO to keep the big cpus more in idle, we should give them big energy
wake up cost. That's my 3rd feature to the EM presented in OSPM2023.


Of course userspace has control by selecting the right uclamp_max value. They
can increase it to allow a spill to next pd - or keep it low to steer them more
strongly on a specific pd.

This would we be interesting to see in practice. I think we need such
experiment, for such changes.

Regards,
Lukasz