Re: [PATCH v2] sched/fair: Age the average idle time

From: Vincent Guittot
Date: Thu Jun 17 2021 - 06:03:14 EST


On Thu, 17 Jun 2021 at 11:40, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Thu, Jun 17, 2021 at 10:30:09AM +0200, Vincent Guittot wrote:
> > > > > I'm tempted to give it a go.. anyone object?
> > > >
> > > > Just finished running some tests on my large arm64 system.
> > > > Tbench tests are a mixed between small gain and loss
> > > >
> > >
> > > Same for tbench on three x86 machines I reran tests for
> > >
> > > <SNIP>
> > >
> > > For your arm machine, how many logical CPUs are online, what is the level
> > > of SMT if any and is the machine NUMA?
> >
> > It's a SMT4 x 28 cores x 2 NUMA nodes = 224 CPUs
> >
>
> Ok, SMT4 is what I was interested in. I suspected this was the case but
> was not sure. I wondered about the possibility that SMT4 should be
> accounted for in the scan depth.
>
> > >
> > > Fundamentally though, as the changelog notes "due to the nature of the
> > > patch, this is a regression magnet". There are going to be examples
> > > where a deep search is better even if a machine is fully busy or
> > > overloaded and examples where cutting off the search is better. I think
> > > it's better to have an idle estimate that gets updated if CPUs are fully
> > > busy even if it's not a universal win.
> >
> > Although I agree that using a stall average idle time value of local
> > is not good, I'm not sure this proposal is better. The main problem is
> > that we use the avg_idle of the local CPU to estimate how many times
> > we should loop and try to find another idle CPU. But there is no
> > direct relation between both.
>
> This is true. The idle time of the local CPU is used to estimate the
> idle time of the domain which is inevitably going to be inaccurate but

I'm more and more convinced that using average idle time (of the
local cpu or the full domain) is not the right metric. In
select_idle_cpu(), we looks for an idle CPU but we don't care about
how long it will be idle. Even more, we can scan all CPUs whatever the
avg idle time if there is a chance that there is an idle core.

> tracking idle time for the domain will be cache write intensive and
> potentially very expensive. I think this was discussed before but maybe
> it is my imaginaction.
>
> > Typically, a short average idle time on
> > the local CPU doesn't mean that there are less idle CPUs and that's
> > why we have a mix a gain and loss
> >
>
> Can you evaluate if scanning proportional to cores helps if applied on
> top? The patch below is a bit of pick&mix and has only seen a basic build

I will queue it for some test later today

> test with a distro config. While I will queue this, I don't expect it to
> have an impact on SMT2.
>
> --8--
> sched/fair: Make select_idle_cpu() proportional to cores
>
> From: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
>
> Instead of calculating how many (logical) CPUs to scan, compute how
> many cores to scan.
>
> This changes behaviour for anything !SMT2.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> ---
> kernel/sched/core.c | 17 ++++++++++++-----
> kernel/sched/fair.c | 11 +++++++++--
> kernel/sched/sched.h | 2 ++
> 3 files changed, 23 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6a3fdb9f4380..1773e0707a5d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7846,11 +7846,18 @@ int sched_cpu_activate(unsigned int cpu)
> balance_push_set(cpu, false);
>
> #ifdef CONFIG_SCHED_SMT
> - /*
> - * When going up, increment the number of cores with SMT present.
> - */
> - if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
> - static_branch_inc_cpuslocked(&sched_smt_present);
> + do {
> + int weight = cpumask_weight(cpu_smt_mask(cpu));
> +
> + if (weight > sched_smt_weight)
> + sched_smt_weight = weight;
> +
> + /*
> + * When going up, increment the number of cores with SMT present.
> + */
> + if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
> + static_branch_inc_cpuslocked(&sched_smt_present);
> + } while (0);
> #endif
> set_cpu_active(cpu, true);
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cc7d1144a356..4fc4e1f2eaae 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6037,6 +6037,8 @@ static inline int __select_idle_cpu(int cpu)
> DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> EXPORT_SYMBOL_GPL(sched_smt_present);
>
> +int __read_mostly sched_smt_weight = 1;
> +
> static inline void set_idle_cores(int cpu, int val)
> {
> struct sched_domain_shared *sds;
> @@ -6151,6 +6153,8 @@ static inline bool test_idle_cores(int cpu, bool def)
> return def;
> }
>
> +#define sched_smt_weight 1
> +
> static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
> {
> return __select_idle_cpu(core);
> @@ -6163,6 +6167,8 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>
> #endif /* CONFIG_SCHED_SMT */
>
> +#define sis_min_cores 2
> +
> /*
> * Scan the LLC domain for idle CPUs; this is dynamically regulated by
> * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> @@ -6203,11 +6209,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> avg_cost = this_sd->avg_scan_cost + 1;
>
> span_avg = sd->span_weight * avg_idle;
> - if (span_avg > 4*avg_cost)
> + if (span_avg > sis_min_cores * avg_cost)
> nr = div_u64(span_avg, avg_cost);
> else
> - nr = 4;
> + nr = sis_min_cores;
>
> + nr *= sched_smt_weight;
> time = cpu_clock(this);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 7bc20e5541cf..440a2bbc19d5 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1119,6 +1119,8 @@ static inline bool is_migration_disabled(struct task_struct *p)
> #ifdef CONFIG_SCHED_SMT
> extern void __update_idle_core(struct rq *rq);
>
> +extern int sched_smt_weight;
> +
> static inline void update_idle_core(struct rq *rq)
> {
> if (static_branch_unlikely(&sched_smt_present))