Re: [RFC PATCH 3/3] sched/fair: Traverse cpufreq policies to detect capacity inversion

From: Dietmar Eggemann
Date: Tue Dec 13 2022 - 12:38:29 EST


On 12/12/2022 19:43, Qais Yousef wrote:
> On 12/09/22 17:47, Vincent Guittot wrote:

[...]

> HMP systems for 1k servers just don't make any sense. A desktop with 128 or
> even 256 HMP cores is a big stretch; and if that exist I don't think there's an
> overhead to worry about here; and I *did* consider this. I measured the impact
> if we have 128 and it was mere 1 or 2 us extra. And that's on under powered
> pine book pro. If such a system exist it'd probably be more performant.
>
>> uclamp_min must not set a CPU overutilized because the CPU is not overutilized
>> in this case. It's only the task that is misfit. You mostly try to bias some
>> behavior to fit your use case.
>
> Maybe we are talking about different things over here. As long as we agree it's
> a misfit task then we are aligned.

IMHO, utilization is about the running task and uclamp is maintained
taking the runnable tasks into consideration as well. Maybe that's the
source of the different views here?

> As far as I know misfit required overutilized to re-enable load balance. But
> maybe there's a detail that's creating this confusion.

I think that Vincent is suggesting to let MF load balance happening even
in !OverUtilized (OU). We gather the necessary load-balance statistics
already today in !OU so it is easily to do.

>>>> doesn means that the cpu is overutilized and uclamp_max should be used
>>>
>>> It is a misfit task; which requires overutilized to be set to re-enable load
>>> balance for uclamp_min to upgrate it. For uclamp max we should not set
>>> overutilized, agreed and that's what we should be doing.
>>
>> That's probably the root of your problem here. The load balance is still
>> periodically called even when EAS is enabled but the latter prevents task
>> migration unless overutilized in order to not break the energy aware task
>
> Okay. For me this means load_balance is disabled since it's effectively doing
> nothing. So maybe it's a terminology problem of what I meant with load balance
> is disabled.
>
>> placement. But if a task is misplaced and a cpu finds it can help,
>> we should let it pull the task without disabling EAS. This will not enable the
>> performance spread behavior and we can expect the other small tasks to be
>> packed by EAS on the best cpu at next wakeup.
>>
>> So intead of trying to detect a very specific capacity inversion use case
>> during this periodic this load balance, it's better to let a CPU that can
>> fix the misfit situation to pull the task.
>
> I can't see the relation here.

To me it looks like that there are 2 stories here:

(A) Why do we need `CPU capacity inversion` (CapInv)? What do we gain
by changing `cap = capacity_orig - arch_scale_thermal_pressure()`
(i) to `cap = capacity_orig - thermal_load_avg()` (ii)? Do we want
to avoid a CPU as a possible target in cases where pressure is
already back to 0 and we still have a thermal PELT signal > 0
because of it's historic information? But this is then only
considering the decaying side of the thermal PELT signal.

Vincent replaces (i) by (ii) so no need to switch to (ii) for
CapInv.

(B) Don't use util_fits_cpu() in cpu_overutilized() and replace this by
MF load balance in !OU.

[...]

>> I have reverted patches:
>> Revert "sched/fair: Detect capacity inversion"
>> Revert "sched/fair: Consider capacity inversion in util_fits_cpu()"
>> Revert "sched/uclamp: Make cpu_overutilized() use util_fits_cpu()"
>>
>> ---
>> kernel/sched/fair.c | 32 +++++++++++++++++---------------
>> 1 file changed, 17 insertions(+), 15 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4423681baf15..6e54afc0a6ec 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4539,7 +4539,7 @@ static inline int util_fits_cpu(unsigned long util,
>> * the time.
>> */
>> capacity_orig = capacity_orig_of(cpu);
>> - capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
>> + capacity_orig_thermal = capacity_orig - thermal_load_avg(cpu_rq(cpu));
>>
>> /*
>> * We want to force a task to fit a cpu as implied by uclamp_max.
>> @@ -4578,8 +4578,7 @@ static inline int util_fits_cpu(unsigned long util,
>> * 2. The system is being saturated when we're operating near
>> * max capacity, it doesn't make sense to block overutilized.
>> */
>> - uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
>> - uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
>> + uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE);
>> fits = fits || uclamp_max_fits;
>>
>> /*
>> @@ -4614,7 +4613,7 @@ static inline int util_fits_cpu(unsigned long util,
>> * handle the case uclamp_min > uclamp_max.
>> */
>> uclamp_min = min(uclamp_min, uclamp_max);
>> - if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE)
>> + if (util < uclamp_min)
>> fits = fits && (uclamp_min <= capacity_orig_thermal);
>>
>> return fits;
>> @@ -6064,7 +6063,10 @@ static inline void hrtick_update(struct rq *rq)
>> #ifdef CONFIG_SMP
>> static inline bool cpu_overutilized(int cpu)
>> {
>> - return !fits_capacity(cpu_util_cfs(cpu), capacity_of(cpu));
>> + unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
>> +
>> + return !(fits_capacity(cpu_util_cfs(cpu), capacity_of(cpu)) ||
>> + rq_util_max <= capacity_orig_of(cpu));
>> }
>>
>> static inline void update_overutilized_status(struct rq *rq)
>> @@ -10164,24 +10166,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>> */
>> update_sd_lb_stats(env, &sds);
>>
>> - if (sched_energy_enabled()) {
>> - struct root_domain *rd = env->dst_rq->rd;
>> -
>> - if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
>> - goto out_balanced;
>> - }
>> -
>> - local = &sds.local_stat;
>> - busiest = &sds.busiest_stat;
>> -
>> /* There is no busy sibling group to pull tasks from */
>> if (!sds.busiest)
>> goto out_balanced;
>>
>> + busiest = &sds.busiest_stat;
>> +
>> /* Misfit tasks should be dealt with regardless of the avg load */
>> if (busiest->group_type == group_misfit_task)
>> goto force_balance;
>>
>> + if (sched_energy_enabled()) {
>> + struct root_domain *rd = env->dst_rq->rd;
>> +
>> + if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
>> + goto out_balanced;
>> + }
>> +
>> /* ASYM feature bypasses nice load balance check */
>> if (busiest->group_type == group_asym_packing)
>> goto force_balance;
>> @@ -10194,6 +10195,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>> if (busiest->group_type == group_imbalanced)
>> goto force_balance;
>>
>> + local = &sds.local_stat;
>> /*
>> * If the local group is busier than the selected busiest group
>> * don't try and pull any tasks.
>> --
>> 2.17.1

Specific questions regarding this patch proposal so I can fully
understand the intention:

I see 4 major changes here:

(1) Remove `CPU capacity inversion` (CapInv)

(2) Don't use util_fits_cpu() in cpu_overutilized()

(3) Use `capacity = capacity_orig - thermal_load_avg() (ii)` (and not `-
arch_scale_thermal_pressure() (i)` in util_fits_cpu(), i.e. in
sis(), feec(), MF handling (update + load balance (lb))

(4) Allow MF lb in !OverUtilized (OU)

(1) CapInv wouldn't be needed in case only (ii) is used (so we don't
have to switch from (i) to (ii) (because of 3)?

(2) is there since we don't have to raise OU anymore when `rq_util_min >
capacity_orig_thermal`. This is replaced by MF handling in !OU lb (4)?

[...]