Re: [PATCH v2 8/9] sched/fair: Detect capacity inversion
From: Qais Yousef
Date: Sat Nov 12 2022 - 14:35:35 EST
On 11/09/22 11:42, Dietmar Eggemann wrote:
[...]
> > + /*
> > + * Detect if the performance domain is in capacity inversion state.
> > + *
> > + * Capacity inversion happens when another perf domain with equal or
> > + * lower capacity_orig_of() ends up having higher capacity than this
> > + * domain after subtracting thermal pressure.
> > + *
> > + * We only take into account thermal pressure in this detection as it's
> > + * the only metric that actually results in *real* reduction of
> > + * capacity due to performance points (OPPs) being dropped/become
> > + * unreachable due to thermal throttling.
> > + *
> > + * We assume:
> > + * * That all cpus in a perf domain have the same capacity_orig
> > + * (same uArch).
> > + * * Thermal pressure will impact all cpus in this perf domain
> > + * equally.
> > + */
> > + if (static_branch_unlikely(&sched_asym_cpucapacity)) {
>
> This should be sched_energy_enabled(). Performance Domains (PDs) are an
> EAS thing.
Bummer. I had a version that used cpumasks only, but I thought using pds is
cleaner and will save unnecessarily extra traversing. But I missed that it's
conditional on sched_energy_enabled().
This is not good news for CAS.
>
> > + unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
>
> rcu_read_lock()
>
> > + struct perf_domain *pd = rcu_dereference(rq->rd->pd);
>
> rcu_read_unlock()
Shouldn't we continue to hold it while traversing the pd too?
>
> It's called from build_sched_domains() too. I assume
> static_branch_unlikely(&sched_asym_cpucapacity) hides this issue so far.
>
> > +
> > + rq->cpu_capacity_inverted = 0;
> > +
> > + for (; pd; pd = pd->next) {
> > + struct cpumask *pd_span = perf_domain_span(pd);
> > + unsigned long pd_cap_orig, pd_cap;
> > +
> > + cpu = cpumask_any(pd_span);
> > + pd_cap_orig = arch_scale_cpu_capacity(cpu);
> > +
> > + if (capacity_orig < pd_cap_orig)
> > + continue;
> > +
> > + /*
> > + * handle the case of multiple perf domains have the
> > + * same capacity_orig but one of them is under higher
>
> Like I said above, I'm not aware of such an EAS system.
I did argue against that. But Vincent's PoV was that we shouldn't make
assumptions and handle the case where we have big cores each on its own domain.
>
> > + * thermal pressure. We record it as capacity
> > + * inversion.
> > + */
> > + if (capacity_orig == pd_cap_orig) {
> > + pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
> > +
> > + if (pd_cap > inv_cap) {
> > + rq->cpu_capacity_inverted = inv_cap;
> > + break;
> > + }
>
> In case `capacity_orig == pd_cap_orig` and cpumask_test_cpu(cpu_of(rq),
> pd_span) the code can set rq->cpu_capacity_inverted = inv_cap
> erroneously since thermal_load_avg(rq) can return different values for
> inv_cap and pd_cap.
Good catch!
>
> So even on a classical big little system, this condition can set
> rq->cpu_capacity_inverted for a CPU in the little or big cluster.
>
> thermal_load_avg(rq) would have to stay constant for all CPUs within the
> PD to avoid this.
>
> This is one example of the `thermal pressure` is per PD (or Frequency
> Domain) in Thermal but per-CPU in the task scheduler.
Only compile tested so far, does this patch address all your points? I should
get hardware soon to run some tests and send the patch. I might re-write it to
avoid using pds; though it seems cleaner this way but we miss CAS support.
Thoughts?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 89dadaafc1ec..b01854984994 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8856,16 +8856,24 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
* * Thermal pressure will impact all cpus in this perf domain
* equally.
*/
- if (static_branch_unlikely(&sched_asym_cpucapacity)) {
- unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
- struct perf_domain *pd = rcu_dereference(rq->rd->pd);
+ if (sched_energy_enabled()) {
+ struct perf_domain *pd;
+ unsigned long inv_cap;
+
+ rcu_read_lock();
+ inv_cap = capacity_orig - thermal_load_avg(rq);
+ pd = rcu_dereference(rq->rd->pd);
rq->cpu_capacity_inverted = 0;
for (; pd; pd = pd->next) {
struct cpumask *pd_span = perf_domain_span(pd);
unsigned long pd_cap_orig, pd_cap;
+ /* We can't be inverted against our own pd */
+ if (cpumask_test_cpu(cpu_of(rq), pd_span))
+ continue;
+
cpu = cpumask_any(pd_span);
pd_cap_orig = arch_scale_cpu_capacity(cpu);
@@ -8890,6 +8898,8 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
break;
}
}
+
+ rcu_read_unlock();
}
Thanks!
--
Qais Yousef