Re: [RFC PATCH 3/3] sched/fair: Traverse cpufreq policies to detect capacity inversion

From: Lukasz Luba
Date: Tue Dec 13 2022 - 12:42:53 EST


Hi Qais,

I thought I could help with this issue.

On 12/12/22 18:43, Qais Yousef wrote:
On 12/09/22 17:47, Vincent Guittot wrote:

[...]

This patch loops on all cpufreq policy in sched softirq, how can this
be sane ? and not only in eas mode but also in the default asymmetric

Hmm I'm still puzzled. Why it's not sane to do it here but it's okay to do it
in the wake up path in feec()?

feec() should be considered as an exception not as the default rule.
Thing like above which loops for_each on external subsystem should be
prevented and the fact that feec loops all PDs doesn't means that we
can put that everywhere else

Fair enough. But really understanding the root cause behind this limitation
will be very helpful. I don't have the same appreciation of why this is
a problem, and shedding more light will help me to think more about it in the
future.


Take the example of 1k cores with per cpu policy. Do you really think a
for_each_cpufreq_policy would be reasonable ?

Hmm I don't think such an HMP system makes sense to ever exist.

That system has to be a multi-socket system and I doubt inversion detection is
something of value.

Point taken anyway. Let's find another way to do this.


Another way might be to use the 'update' code path, which sets this
information source, for the thermal pressure. That code path isn't as
hot as this in the task scheduler. Furthermore, we would also
have time and handle properly CPU hotplug callbacks there.

So something like this, I have in mind:

------------------------------8<-----------------------------
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index e7d6e6657ffa..7f372a93e21b 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -16,6 +16,7 @@
#include <linux/sched/topology.h>
#include <linux/cpuset.h>
#include <linux/cpumask.h>
+#include <linux/mutex.h>
#include <linux/init.h>
#include <linux/rcupdate.h>
#include <linux/sched.h>
@@ -153,6 +154,33 @@ void topology_set_freq_scale(const struct cpumask *cpus, unsigned long cur_freq,
DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
EXPORT_PER_CPU_SYMBOL_GPL(cpu_scale);

+static struct cpumask highest_capacity_mask;

+static struct cpumask highest_capacity_mask;
+static unsigned int max_possible_capacity;
+static DEFINE_MUTEX(max_capacity_lock);
+
+static void max_capacity_update(const struct cpumask *cpus,
+ unsigned long capacity)
+{
+ mutex_lock(&max_capacity_lock);
+
+ if (max_possible_capacity < capacity) {
+ max_possible_capacity = capacity;
+
+ cpumask_clear(&highest_capacity_mask);
+
+ cpumask_or(&highest_capacity_mask,
+ &highest_capacity_mask, cpus);
+ }
+
+ mutex_unlock(&max_capacity_lock);
+}
+
+bool topology_test_max_cpu_capacity(unsigned int cpu)
+{
+ return cpumask_test_cpu(cpu, &highest_capacity_mask);
+}
+EXPORT_SYMBOL_GPL(topology_test_max_cpu_capacity);
+
void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
{
per_cpu(cpu_scale, cpu) = capacity;
@@ -203,6 +231,8 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,

for_each_cpu(cpu, cpus)
WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure);
+
+ max_capacity_update(cpus, capacity);
}
EXPORT_SYMBOL_GPL(topology_update_thermal_pressure);


--------------------------->8--------------------------------

We could use the RCU if there is a potential to read racy date
while the updater modifies the mask in the meantime. Mutex is to
serialize the thermal writers which might be kicked for two
policies at the same time.

If you like I can develop and test such code in the arch_topology.c

Regards,
Lukasz