Re: [PATCH v3] sched/fair: bring back select_idle_smt, but differently

From: Vincent Guittot
Date: Tue Apr 06 2021 - 11:10:36 EST


On Fri, 26 Mar 2021 at 20:19, Rik van Riel <riel@xxxxxxxxxxx> wrote:
>
> On Mon, 22 Mar 2021 11:03:06 +0000
> Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
>
> > Second, select_idle_smt() does not use the cpus mask so consider moving
> > the cpus initialisation after select_idle_smt() has been called.
> > Specifically this initialisation
> >
> > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >
> > Alternatively, clear the bits in the SMT sibling scan to avoid checking
> > the siblings twice. It's a tradeoff because initialising and clearing
> > bits is not free and the cost is wasted if a sibling is free.
>
> I tried a number of different variations on moving the CPU mask
> initialization, and clearing CPUs from the mask, and failed to
> get any clear results from those in testing, even in workloads
> with lots of context switches.
>
> Below is a simple version that seems to perform identically to
> more complicated versions :)
>
> ---8<---
> sched,fair: bring back select_idle_smt, but differently
>
> Mel Gorman did some nice work in 9fe1f127b913
> ("sched/fair: Merge select_idle_core/cpu()"), resulting in the kernel
> being more efficient at finding an idle CPU, and in tasks spending less
> time waiting to be run, both according to the schedstats run_delay
> numbers, and according to measured application latencies. Yay.
>
> The flip side of this is that we see more task migrations (about
> 30% more), higher cache misses, higher memory bandwidth utilization,
> and higher CPU use, for the same number of requests/second.
>
> This is most pronounced on a memcache type workload, which saw
> a consistent 1-3% increase in total CPU use on the system, due
> to those increased task migrations leading to higher L2 cache
> miss numbers, and higher memory utilization. The exclusive L3
> cache on Skylake does us no favors there.
>
> On our web serving workload, that effect is usually negligible.
>
> It appears that the increased number of CPU migrations is generally
> a good thing, since it leads to lower cpu_delay numbers, reflecting
> the fact that tasks get to run faster. However, the reduced locality
> and the corresponding increase in L2 cache misses hurts a little.
>
> The patch below appears to fix the regression, while keeping the
> benefit of the lower cpu_delay numbers, by reintroducing select_idle_smt
> with a twist: when a socket has no idle cores, check to see if the
> sibling of "prev" is idle, before searching all the other CPUs.
>
> This fixes both the occasional 9% regression on the web serving
> workload, and the continuous 2% CPU use regression on the memcache
> type workload.
>
> With Mel's patches and this patch together, task migrations are still
> high, but L2 cache misses, memory bandwidth, and CPU time used are back
> down to what they were before. The p95 and p99 response times for the
> memcache type application improve by about 10% over what they were
> before Mel's patches got merged.
>
> Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 68 ++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 52 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 794c2cb945f8..69680158963f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6098,6 +6098,28 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
> return -1;
> }
>
> +/*
> + * Scan the local SMT mask for idle CPUs.
> + */
> +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int
> +target)
> +{
> + int cpu;
> +
> + if (!static_branch_likely(&sched_smt_present))
> + return -1;
> +
> + for_each_cpu(cpu, cpu_smt_mask(target)) {
> + if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
> + !cpumask_test_cpu(cpu, sched_domain_span(sd)))
> + continue;
> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> + return cpu;
> + }
> +
> + return -1;
> +}
> +
> #else /* CONFIG_SCHED_SMT */
>
> static inline void set_idle_cores(int cpu, int val)
> @@ -6114,6 +6136,11 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma
> return __select_idle_cpu(core);
> }
>
> +static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> + return -1;
> +}
> +
> #endif /* CONFIG_SCHED_SMT */
>
> /*
> @@ -6121,7 +6148,7 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma
> * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> * average idle time for this rq (as found in rq->avg_idle).
> */
> -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
> +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int prev, int target)
> {
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> int i, cpu, idle_cpu = -1, nr = INT_MAX;
> @@ -6136,23 +6163,32 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>
> - if (sched_feat(SIS_PROP) && !smt) {
> - u64 avg_cost, avg_idle, span_avg;
> + if (!smt) {
> + if (cpus_share_cache(prev, target)) {

Have you checked the impact on no smt system ? would worth a static branch.

Also, this doesn't need to be in select_idle_cpu() which aims to loop
the sched_domain becaus you only compare target and prev. So you can
move this call to select_idle_smt() in select_idle_sibling() as part
of all the others tests done on prev/target/recent_used_cpu before
calling select_idle_cpu() and even save a useless sd =
rcu_dereference(per_cpu(sd_llc, target));


> + /* No idle core. Check if prev has an idle sibling. */
> + i = select_idle_smt(p, sd, prev);
> + if ((unsigned int)i < nr_cpumask_bits)
> + return i;
> + }
>
> - /*
> - * Due to large variance we need a large fuzz factor;
> - * hackbench in particularly is sensitive here.
> - */
> - avg_idle = this_rq()->avg_idle / 512;
> - avg_cost = this_sd->avg_scan_cost + 1;
> + if (sched_feat(SIS_PROP)) {
> + u64 avg_cost, avg_idle, span_avg;
>
> - span_avg = sd->span_weight * avg_idle;
> - if (span_avg > 4*avg_cost)
> - nr = div_u64(span_avg, avg_cost);
> - else
> - nr = 4;
> + /*
> + * Due to large variance we need a large fuzz factor;
> + * hackbench in particularly is sensitive here.
> + */
> + avg_idle = this_rq()->avg_idle / 512;
> + avg_cost = this_sd->avg_scan_cost + 1;
>
> - time = cpu_clock(this);
> + span_avg = sd->span_weight * avg_idle;
> + if (span_avg > 4*avg_cost)
> + nr = div_u64(span_avg, avg_cost);
> + else
> + nr = 4;
> +
> + time = cpu_clock(this);
> + }
> }
>
> for_each_cpu_wrap(cpu, cpus, target) {
> @@ -6307,7 +6343,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> if (!sd)
> return target;
>
> - i = select_idle_cpu(p, sd, target);
> + i = select_idle_cpu(p, sd, prev, target);
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
> --
> 2.25.4
>
>