in
Hi Abel,
On Fri, May 6, 2022 at 1:21 AM Abel Wu <wuyun.abel@xxxxxxxxxxxxx> wrote:
My understanding is that, this patch aims to address the following issue:
Try to improve searching efficiency of SIS by filtering out the
overloaded cpus, and as a result the more overloaded the system
is, the less cpus will be searched.
What kind of CPUs should the SIS scan.
And we also have another patch[1] from another angle:
How many CPUs should the SIS scan.
I assume the two direction could both help speed up the SIS process, so
I'm curious what the result would be with both patch applied, and I planned
to run your patch on my system too.
The overloaded cpus are tracked through LLC shared domain. ToDo you have any script that I can leverage to launch the test?
regulate accesses to the shared data, the update happens mainly
at the tick. But in order to make it more accurate, we also take
the task migrations into consideration during load balancing which
can be quite frequent due to short running workload causing newly-
idle. Since an overloaded runqueue requires at least 2 non-idle
tasks runnable, we could have more faith on the "frequent newly-
idle" case.
Benchmark
=========
Tests are done in an Intel(R) Xeon(R) Platinum 8260 CPU@2.40GHz
machine with 2 NUMA nodes each of which has 24 cores with SMT2
enabled, so 96 CPUs in total.
All of the benchmarks are done inside a normal cpu cgroup in a
clean environment with cpu turbo disabled.I would recommend to apply the following patch(queued for 5.19) if
the intel_pstate driver is loaded, because it seems that there is a
utilization calculation
issue when turbo is disabled:
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 846bb3a78788..2216b24b6f84 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1322,6 +1322,7 @@ static ssize_t store_no_turbo(struct kobject *a,
struct kobj_attribute *b,
mutex_unlock(&intel_pstate_limits_lock);
intel_pstate_update_policies();
+ arch_set_max_freq_ratio(global.no_turbo);
mutex_unlock(&intel_pstate_driver_lock);
--
2.25.1
[cut]
Although we put the following items into different cache line compared to
v3:
- removed sched-idle balance feature and focus on SIS
- take non-CFS tasks into consideration
- several fixes/improvement suggested by Josh Don
v2:
- several optimizations on sched-idle balancing
- ignore asym topos in can_migrate_task
- add more benchmarks including SIS efficiency
- re-organize patch as suggested by Mel
v1: https://lore.kernel.org/lkml/20220217154403.6497-5-wuyun.abel@xxxxxxxxxxxxx/
v2: https://lore.kernel.org/lkml/20220409135104.3733193-1-wuyun.abel@xxxxxxxxxxxxx/
Signed-off-by: Abel Wu <wuyun.abel@xxxxxxxxxxxxx>
---
include/linux/sched/topology.h | 12 ++++++++++
kernel/sched/core.c | 38 ++++++++++++++++++++++++++++++
kernel/sched/fair.c | 43 +++++++++++++++++++++++++++-------
kernel/sched/idle.c | 1 +
kernel/sched/sched.h | 4 ++++
kernel/sched/topology.c | 4 +++-
6 files changed, 92 insertions(+), 10 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..95c7ad1e05b5 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -81,8 +81,20 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+
+ /*
+ * Tracking of the overloaded cpus can be heavy, so start
+ * a new cacheline to avoid false sharing.
+ */
above ones, is it possible that there is still cache false sharing if
CPU1 is reading nr_overloaded_cpus while
CPU2 is updating overloaded_cpus?
+ atomic_t nr_overloaded_cpus ____cacheline_aligned;____cacheline_aligned seems to put nr_overloaded_cpus into data section, which
seems to be unnecessary. Would ____cacheline_internodealigned_in_smp
be more lightweight?
+ unsigned long overloaded_cpus[]; /* Must be last */[cut]
};
In [1] we used util_avg to check if the domain is overloaded and quit
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d4bd299d67ab..79b4ff24faee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6323,7 +6323,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
- int i, cpu, idle_cpu = -1, nr = INT_MAX;
+ struct sched_domain_shared *sds = sd->shared;
+ int nr, nro, weight = sd->span_weight;
+ int i, cpu, idle_cpu = -1;
struct rq *this_rq = this_rq();
int this = smp_processor_id();
struct sched_domain *this_sd;
@@ -6333,7 +6335,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
if (!this_sd)
return -1;
+ nro = atomic_read(&sds->nr_overloaded_cpus);
+ if (nro == weight)
+ goto out;
+
+ nr = min_t(int, weight, p->nr_cpus_allowed);
+
+ /*
+ * It's unlikely to find an idle cpu if the system is under
+ * heavy pressure, so skip searching to save a few cycles
+ * and relieve cache traffic.
+ */
+ if (weight - nro < (nr >> 4) && !has_idle_core)
+ return -1;
earlier, since util_avg would be
more stable and contains historic data. But I think nr_running in your
patch could be used as
complementary metric and added to update_idle_cpu_scan() in [1] IMO.
+If I understand correctly, this is the core of the optimization: SIS
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ if (nro > 1)
+ cpumask_andnot(cpus, cpus, sdo_mask(sds));
filters out the busy cores. I wonder if it
is possible to save historic h_nr_running/idle_h_nr_running and use
the average value? (like the calculation
of avg_scan_cost).
[cut]
if (sched_feat(SIS_PROP) && !has_idle_core) {
u64 avg_cost, avg_idle, span_avg;
@@ -6354,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
avg_idle = this_rq->wake_avg_idle;
avg_cost = this_sd->avg_scan_cost + 1;
- span_avg = sd->span_weight * avg_idle;
+ span_avg = weight * avg_idle;
if (span_avg > 4*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
[1] https://lore.kernel.org/lkml/20220428182442.659294-1-yu.c.chen@xxxxxxxxx/