Re: [PATCH v2] sched/fair: Age the average idle time
From: Mel Gorman
Date: Thu Jun 17 2021 - 05:40:44 EST
On Thu, Jun 17, 2021 at 10:30:09AM +0200, Vincent Guittot wrote:
> > > > I'm tempted to give it a go.. anyone object?
> > >
> > > Just finished running some tests on my large arm64 system.
> > > Tbench tests are a mixed between small gain and loss
> > >
> >
> > Same for tbench on three x86 machines I reran tests for
> >
> > <SNIP>
> >
> > For your arm machine, how many logical CPUs are online, what is the level
> > of SMT if any and is the machine NUMA?
>
> It's a SMT4 x 28 cores x 2 NUMA nodes = 224 CPUs
>
Ok, SMT4 is what I was interested in. I suspected this was the case but
was not sure. I wondered about the possibility that SMT4 should be
accounted for in the scan depth.
> >
> > Fundamentally though, as the changelog notes "due to the nature of the
> > patch, this is a regression magnet". There are going to be examples
> > where a deep search is better even if a machine is fully busy or
> > overloaded and examples where cutting off the search is better. I think
> > it's better to have an idle estimate that gets updated if CPUs are fully
> > busy even if it's not a universal win.
>
> Although I agree that using a stall average idle time value of local
> is not good, I'm not sure this proposal is better. The main problem is
> that we use the avg_idle of the local CPU to estimate how many times
> we should loop and try to find another idle CPU. But there is no
> direct relation between both.
This is true. The idle time of the local CPU is used to estimate the
idle time of the domain which is inevitably going to be inaccurate but
tracking idle time for the domain will be cache write intensive and
potentially very expensive. I think this was discussed before but maybe
it is my imaginaction.
> Typically, a short average idle time on
> the local CPU doesn't mean that there are less idle CPUs and that's
> why we have a mix a gain and loss
>
Can you evaluate if scanning proportional to cores helps if applied on
top? The patch below is a bit of pick&mix and has only seen a basic build
test with a distro config. While I will queue this, I don't expect it to
have an impact on SMT2.
--8--
sched/fair: Make select_idle_cpu() proportional to cores
From: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Instead of calculating how many (logical) CPUs to scan, compute how
many cores to scan.
This changes behaviour for anything !SMT2.
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
---
kernel/sched/core.c | 17 ++++++++++++-----
kernel/sched/fair.c | 11 +++++++++--
kernel/sched/sched.h | 2 ++
3 files changed, 23 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6a3fdb9f4380..1773e0707a5d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7846,11 +7846,18 @@ int sched_cpu_activate(unsigned int cpu)
balance_push_set(cpu, false);
#ifdef CONFIG_SCHED_SMT
- /*
- * When going up, increment the number of cores with SMT present.
- */
- if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
- static_branch_inc_cpuslocked(&sched_smt_present);
+ do {
+ int weight = cpumask_weight(cpu_smt_mask(cpu));
+
+ if (weight > sched_smt_weight)
+ sched_smt_weight = weight;
+
+ /*
+ * When going up, increment the number of cores with SMT present.
+ */
+ if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+ static_branch_inc_cpuslocked(&sched_smt_present);
+ } while (0);
#endif
set_cpu_active(cpu, true);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cc7d1144a356..4fc4e1f2eaae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6037,6 +6037,8 @@ static inline int __select_idle_cpu(int cpu)
DEFINE_STATIC_KEY_FALSE(sched_smt_present);
EXPORT_SYMBOL_GPL(sched_smt_present);
+int __read_mostly sched_smt_weight = 1;
+
static inline void set_idle_cores(int cpu, int val)
{
struct sched_domain_shared *sds;
@@ -6151,6 +6153,8 @@ static inline bool test_idle_cores(int cpu, bool def)
return def;
}
+#define sched_smt_weight 1
+
static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
{
return __select_idle_cpu(core);
@@ -6163,6 +6167,8 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
#endif /* CONFIG_SCHED_SMT */
+#define sis_min_cores 2
+
/*
* Scan the LLC domain for idle CPUs; this is dynamically regulated by
* comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6203,11 +6209,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
avg_cost = this_sd->avg_scan_cost + 1;
span_avg = sd->span_weight * avg_idle;
- if (span_avg > 4*avg_cost)
+ if (span_avg > sis_min_cores * avg_cost)
nr = div_u64(span_avg, avg_cost);
else
- nr = 4;
+ nr = sis_min_cores;
+ nr *= sched_smt_weight;
time = cpu_clock(this);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7bc20e5541cf..440a2bbc19d5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1119,6 +1119,8 @@ static inline bool is_migration_disabled(struct task_struct *p)
#ifdef CONFIG_SCHED_SMT
extern void __update_idle_core(struct rq *rq);
+extern int sched_smt_weight;
+
static inline void update_idle_core(struct rq *rq)
{
if (static_branch_unlikely(&sched_smt_present))