Re: [PATCH V2 1/2] sched/fair: Prevent cpu_busy_time from exceeding actual_cpu_capacity

From: Pierre Gondois
Date: Thu Jul 04 2024 - 13:01:48 EST


Hello,

On 6/27/24 18:15, Vincent Guittot wrote:
On Thu, 27 Jun 2024 at 04:02, Xuewen Yan <xuewen.yan94@xxxxxxxxx> wrote:

On Tue, Jun 25, 2024 at 9:05 PM Vincent Guittot
<vincent.guittot@xxxxxxxxxx> wrote:

On Mon, 24 Jun 2024 at 10:22, Xuewen Yan <xuewen.yan@xxxxxxxxxx> wrote:

Commit 3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")
changed the PD's util from per-CPU to per-PD capping. But because
the effective_cpu_util() would return a util which maybe bigger
than the actual_cpu_capacity, this could cause the pd_busy_time
calculation errors.

I'm still not convinced that this is an error. Your example used for v1 is :

The pd cpus are 4-7, and the arch_scale_capacity is 1024, and because
of cpufreq-limit, the cpu_actual_cap = 512.

Then the eenv->cpu_cap = 512, the eenv->pd_cap = 2048;
effective_cpu_util(4) = 1024;
effective_cpu_util(5) = 1024;
effective_cpu_util(6) = 256;
effective_cpu_util(7) = 0;

so env->pd_busy_time = 2304

IIUC, with this configuration:
Before patch:
- env->pd_busy_time = 2304
After patch:
- env->pd_busy_time = 512 + 512 + 256 + 0 = 1280

But when computing the energy for the task to place:
compute_energy()
{
if (dst_cpu >= 0)
busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
}

the contribution of the task might be truncated.
E.g.:
Trying to place a task on CPU7 with task_busy_time=300.
Then in compute_energy():

Before patch:
busy_time = min(2048, 2304 + 300)
busy_time = 2048
-> busy_time delta = 2048 - 2304 = 256

After patch:
busy_time = min(2048, 1280 + 300)
busy_time = 1580
-> busy_time delta = 1580 - 1280 = 300


Even if effective_cpu_util(4) = 1024; is above the current max compute
capacity of 512, this also means that activity of cpu4 will run twice
longer . If you cap effective_cpu_util(4) to 512 you miss the
information that it will run twice longer at the selected OPP. The
extreme case being:
effective_cpu_util(4) = 1024;
effective_cpu_util(5) = 1024;
effective_cpu_util(6) = 1024;
effective_cpu_util(7) = 1024;

in this case env->pd_busy_time = 4096

If we cap, we can't make any difference between the 2 cases

Do you have more details about the problem you are facing ?

Because of the cpufreq-limit, the opp was also limited, and when compute_energy:

energy = ps->cost * sum_util = ps->cost * eenv->pd_busy_time;

Because of the cpufreq-limit, the ps->cost is the limited-freq's opp's
cost instead of the max freq's cost.
So the energy is determined by pd_busy_time.

Still the example above:

The pd cpus are 4-7, and the arch_scale_capacity is 1024, and because
of cpufreq-limit, the cpu_actual_cap = 512.

Then the eenv->cpu_cap = 512, the eenv->pd_cap = 2048;
effective_cpu_util(4) = 1024;
effective_cpu_util(5) = 1024;
effective_cpu_util(6) = 256;
effective_cpu_util(7) = 0;

Before the patch:
env->pd_busy_time = min(1024+1024+256, eenv->pd_cap) = 2048.
However, because the effective_cpu_util(7) = 0, indeed, the 2048 is bigger than
the actual_cpu_cap.

After the patch:
cpu_util(4) = min(1024, eenv->cpu_cap) = 512;
cpu_util(5) = min(1024, eenv->cpu_cap) = 512;
cpu_util(6) = min(256, eenv->cpu_cap) = 256;
cpu_util(7) = 0;
env->pd_busy_time = min(512+512+256, eenv->pd_cap) = 1280.

If we take a similar example, on a pd with 2 CPUs and:
- effective_cpu_util(0) = 1024
- effective_cpu_util(1) = 0
and:
- eenv->cpu_cap = 100
- eenv->pd_cap = 100 + 100 = 200

Before the patch:
- env->pd_busy_time = min(1024 + 0, eenv->pd_cap) = 200

After the patch:
- cpu_util(0) = min(1024, eenv->cpu_cap) = 100
- cpu_util(1) = min(0, eenv->cpu_cap) = 0
and:
- env->pd_busy_time = min(100 + 0, eenv->pd_cap) = 100

-------------

Same example while adding additional empty CPUs:
- effective_cpu_util(0) = 1024
- effective_cpu_util(1) = 0
- effective_cpu_util(2) = 0
- effective_cpu_util(3) = 0
and:
- eenv->cpu_cap = 100
- eenv->pd_cap = 100 * 4 = 400

Before the patch:
- env->pd_busy_time = min(1024 + 0 + 0 + 0, eenv->pd_cap) = 400

After the patch:
- cpu_util(0) = min(1024, eenv->cpu_cap) = 100
- cpu_util(1) = min(0, eenv->cpu_cap) = 0
- ...
and:
- env->pd_busy_time = min(100 + 0 + 0 + 0, eenv->pd_cap) = 100

-------------

What seems strange is that the more we add empty CPUs in a pd,
the more energy is spent in the first computation, which seems
indeed incorrect (unless I missed something).


As a result, without this patch, the energy is bigger than actual_energy.

And even if cpu4 would run twice longer, the energy may not be equal.
Because:
* ps->power * cpu_max_freq
* cpu_nrg = ------------------------ * cpu_util (3)
* ps->freq * scale_cpu

the ps->power = cfv2, and then:

* cv2 * cpu_max_freq
* cpu_nrg = ------------------------ * cpu_util (3)
* scale_cpu

because the limited-freq's voltage is not equal to the max-freq's voltage.

I'm still struggling to understand why it's wrong. If the frequency is
capped, we will never go above this limited frequency and its
associated voltage so there is no reason to consider max-freq's
voltage. If there is more things to do than the actual capacity can do
per unit of time then we will run more than 1 unit of time.

nrg of PD = /Sum(cpu) ps->power * cpu-running-time

ps->power is fixed because of the limited frequency constraint

we estimate cpu-running-time = utilization / ps->performance
with
- utilization = util_avg
- performance = ps->freq / cpu_max_freq * arch_scale_cpu_capacity() =
ps->performance

Up to now we were assuming that utilization was always lower than
performance otherwise the system was overutilized andwe fallback in
performance mode. But when the frequency of a cpu is limited by
userspace or thermal mitigation, the utilization can become higher
than the limited capacity which can be translated by cpu will run
longer.

I thought the EAS was comparing instantaneous power and not energy,
i.e. how the energy computation is done:

* ps->power * cpu_max_freq
* cpu_nrg = ------------------------ * cpu_util (3)
* ps->freq * scale_cpu

cpu_nrg should have the same dimension as ps->power (i.e. energy/s).
From this PoV, the energy computation should not take into account how
much time a task is expected to run. But it might be a side discussion,

Regards,
Pierre







So clamp the cpu_busy_time with the eenv->cpu_cap, which is
the actual_cpu_capacity.

Fixes: 3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")
Signed-off-by: Xuewen Yan <xuewen.yan@xxxxxxxxxx>
Tested-by: Christian Loehle <christian.loehle@xxxxxxx>
---
V2:
- change commit message.
- remove the eenv->pd_cap capping in eenv_pd_busy_time(). (Dietmar)
- add Tested-by.
---
kernel/sched/fair.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a5b1ae0aa55..5ca6396ef0b7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7864,16 +7864,17 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
struct cpumask *pd_cpus,
struct task_struct *p)
{
- unsigned long busy_time = 0;
int cpu;

+ eenv->pd_busy_time = 0;
+
for_each_cpu(cpu, pd_cpus) {
unsigned long util = cpu_util(cpu, p, -1, 0);

- busy_time += effective_cpu_util(cpu, util, NULL, NULL);
+ util = effective_cpu_util(cpu, util, NULL, NULL);
+ util = min(eenv->cpu_cap, util);
+ eenv->pd_busy_time += util;
}
-
- eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
}

/*
--
2.25.1