RE: [RFC PATCH v5 3/4] scheduler: scan idle cpu in cluster before scanning the whole llc

From: Song Bao Hua (Barry Song)
Date: Fri Mar 19 2021 - 17:40:20 EST




> -----Original Message-----
> From: Song Bao Hua (Barry Song)
> Sent: Friday, March 19, 2021 5:16 PM
> To: tim.c.chen@xxxxxxxxxxxxxxx; catalin.marinas@xxxxxxx; will@xxxxxxxxxx;
> rjw@xxxxxxxxxxxxx; vincent.guittot@xxxxxxxxxx; bp@xxxxxxxxx;
> tglx@xxxxxxxxxxxxx; mingo@xxxxxxxxxx; lenb@xxxxxxxxxx; peterz@xxxxxxxxxxxxx;
> dietmar.eggemann@xxxxxxx; rostedt@xxxxxxxxxxx; bsegall@xxxxxxxxxx;
> mgorman@xxxxxxx
> Cc: msys.mizuma@xxxxxxxxx; valentin.schneider@xxxxxxx;
> gregkh@xxxxxxxxxxxxxxxxxxx; Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>;
> juri.lelli@xxxxxxxxxx; mark.rutland@xxxxxxx; sudeep.holla@xxxxxxx;
> aubrey.li@xxxxxxxxxxxxxxx; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; linux-acpi@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx;
> xuwei (O) <xuwei5@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>;
> guodong.xu@xxxxxxxxxx; yangyicong <yangyicong@xxxxxxxxxx>; Liguozhu (Kenneth)
> <liguozhu@xxxxxxxxxxxxx>; linuxarm@xxxxxxxxxxxxx; hpa@xxxxxxxxx; Song Bao Hua
> (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>
> Subject: [RFC PATCH v5 3/4] scheduler: scan idle cpu in cluster before scanning
> the whole llc
>
> On kunpeng920, cpus within one cluster can communicate wit each other
> much faster than cpus across different clusters. A simple hackbench
> can prove that.
> hackbench running on 4 cpus in single one cluster and 4 cpus in
> different clusters shows a large contrast:
> (1) within a cluster:
> root@ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 20000 -g 1
> Running in threaded mode with 1 groups using 40 file descriptors each
> (== 40 tasks)
> Each sender will pass 20000 messages of 100 bytes
> Time: 4.285
>
> (2) across clusters:
> root@ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 20000 -g 1
> Running in threaded mode with 1 groups using 40 file descriptors each
> (== 40 tasks)
> Each sender will pass 20000 messages of 100 bytes
> Time: 5.524
>
> This inspires us to change the wake_affine path to scan cluster before
> scanning the whole LLC to try to gatter related tasks in one cluster,
> which is done by this patch.
>
> To evaluate the performance impact to related tasks talking with each
> other, we run the below hackbench with different -g parameter from 2
> to 14, for each different g, we run the command 10 times and get the
> average time:
> $ numactl -N 0 hackbench -p -T -l 20000 -g $1
>
> hackbench will report the time which is needed to complete a certain number
> of messages transmissions between a certain number of tasks, for example:
> $ numactl -N 0 hackbench -p -T -l 20000 -g 10
> Running in threaded mode with 10 groups using 40 file descriptors each
> (== 400 tasks)
> Each sender will pass 20000 messages of 100 bytes
>
> The below is the result of hackbench w/ and w/o cluster patch:
> g= 2 4 6 8 10 12 14
> w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
> w/ : 1.7881 3.7371 5.3301 6.9747 8.6909 9.9235 11.2608
>
> Obviously some recent commits have improved the hackbench. So the change
> in wake_affine path brings less increase on hackbench compared to what
> we got in RFC v4.
> And obviously it is much more tricky to leverage wake_affine compared to
> leveraging the scatter of tasks in the previous patch as load balance
> might pull tasks which have been compact in a cluster so alternative
> suggestions welcome.
>
> In order to figure out how many times cpu is picked from the cluster and
> how many times cpu is picked out of the cluster, a tracepoint for debug
> purpose is added in this patch. And an userspace bcc script to print the
> histogram of the result of select_idle_cpu():
> #!/usr/bin/python
> #
> # selectidlecpu.py select idle cpu histogram.
> #
> # A Ctrl-C will print the gathered histogram then exit.
> #
> # 18-March-2021 Barry Song Created this.
>
> from __future__ import print_function
> from bcc import BPF
> from time import sleep
>
> # load BPF program
> b = BPF(text="""
>
> BPF_HISTOGRAM(dist);
>
> TRACEPOINT_PROBE(sched, sched_select_idle_cpu)
> {
> u32 e;
> if (args->idle / 4 == args->target/4)
> e = 0; /* idle cpu from same cluster */

Oops here, as -1/4 = 1/4 = 2/4 = 3/4 = 0
So a part of -1 is put here(local cluster) incorrectly.

> else if (args->idle != -1)
> e = 1; /* idle cpu from different clusters */
> else
> e = 2; /* no idle cpu */
>
> dist.increment(e);
> return 0;
> }
> """)

Fixed it to:

TRACEPOINT_PROBE(sched, sched_select_idle_cpu)
{
u32 e;
if (args->idle == -1)
e = 2; /* no idle cpu */
else if (args->idle / 4 == args->target / 4)
e = 0; /* idle cpu from same cluster */
else
e = 1; /* idle cpu from different clusters */

dist.increment(e);
return 0;
}

>
> # header
> print("Tracing... Hit Ctrl-C to end.")
>
> # trace until Ctrl-C
> try:
> sleep(99999999)
> except KeyboardInterrupt:
> print()
>
> # output
>
> print("\nlinear histogram")
> print("~~~~~~~~~~~~~~~~")
> b["dist"].print_linear_hist("idle")
>
> Even while g=14 and the system is quite busy, we can see there are some
> chances idle cpu is picked from local cluster:
> linear histogram
> ~~~~~~~~~~~~~~
> idle : count distribution
> 0 : 15234281 |*********** |
> 1 : 18494 | |
> 2 : 53066152 |****************************************|
>
> 0: local cluster
> 1: out of the cluster
> 2: select_idle_cpu() returns -1

After fixing the script, the new histogram is like:
linear histogram
~~~~~~~~~~~~~~~~
idle : count distribution
0 : 2765930 |* |
1 : 68934 | |
2 : 77667475 |****************************************|

We get
Local cluster: 3.4358%
Out of cluster: 0.0856%
-1(no idle before nr becomes 0): 96.4785%

>
> Signed-off-by: Barry Song <song.bao.hua@xxxxxxxxxxxxx>
> ---
> include/trace/events/sched.h | 22 ++++++++++++++++++++++
> kernel/sched/fair.c | 32 +++++++++++++++++++++++++++++++-
> 2 files changed, 53 insertions(+), 1 deletion(-)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index cbe3e15..86608cf 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -136,6 +136,28 @@
> );
>
> /*
> + * Tracepoint for select_idle_cpu:
> + */
> +TRACE_EVENT(sched_select_idle_cpu,
> +
> + TP_PROTO(int target, int idle),
> +
> + TP_ARGS(target, idle),
> +
> + TP_STRUCT__entry(
> + __field( int, target )
> + __field( int, idle )
> + ),
> +
> + TP_fast_assign(
> + __entry->target = target;
> + __entry->idle = idle;
> + ),
> +
> + TP_printk("target=%d idle=%d", __entry->target, __entry->idle)
> +);
> +
> +/*
> * Tracepoint for waking up a task:
> */
> DECLARE_EVENT_CLASS(sched_wakeup_template,
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c92ad9f2..3892d42 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6150,7 +6150,12 @@ static int select_idle_cpu(struct task_struct *p, struct
> sched_domain *sd, int t
> if (!this_sd)
> return -1;
>
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + if (!sched_cluster_active())
> + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +#ifdef CONFIG_SCHED_CLUSTER
> + if (sched_cluster_active())
> + cpumask_and(cpus, cpu_cluster_mask(target), p->cpus_ptr);
> +#endif
>
> if (sched_feat(SIS_PROP) && !smt) {
> u64 avg_cost, avg_idle, span_avg;
> @@ -6171,6 +6176,29 @@ static int select_idle_cpu(struct task_struct *p, struct
> sched_domain *sd, int t
> time = cpu_clock(this);
> }
>
> +#ifdef CONFIG_SCHED_CLUSTER
> + if (sched_cluster_active()) {
> + for_each_cpu_wrap(cpu, cpus, target) {
> + if (smt) {
> + i = select_idle_core(p, cpu, cpus, &idle_cpu);
> + if ((unsigned int)i < nr_cpumask_bits)
> + return i;
> +
> + } else {
> + if (!--nr)
> + return -1;
> + idle_cpu = __select_idle_cpu(cpu);
> + if ((unsigned int)idle_cpu < nr_cpumask_bits) {
> + goto done;
> + }
> + }

BTW, if I return -1 here directly and don't fall back to LLC, I
can even get a better benchmark:

g= 2 4 6 8 10 12 14
w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
w/ : 1.7881 3.7371 5.3301 6.9747 8.6909 9.9235 11.2608
return -1: 1.8324 3.6140 5.1029 6.5016 8.1867 9.7559 10.7716

so it seems the wake-up path change is much more trivial to
get a real and good impact, comparing to the previous 2/4
patch in which we are only spreading tasks.

> + }
> +
> + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + cpumask_andnot(cpus, cpus, cpu_cluster_mask(target));
> + }
> +#endif
> +
> for_each_cpu_wrap(cpu, cpus, target) {
> if (smt) {
> i = select_idle_core(p, cpu, cpus, &idle_cpu);
> @@ -6186,6 +6214,7 @@ static int select_idle_cpu(struct task_struct *p, struct
> sched_domain *sd, int t
> }
> }
>
> +done:
> if (smt)
> set_idle_cores(this, false);
>
> @@ -6324,6 +6353,7 @@ static int select_idle_sibling(struct task_struct *p,
> int prev, int target)
> return target;
>
> i = select_idle_cpu(p, sd, target);
> + trace_sched_select_idle_cpu(target, i);
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
> --
> 1.8.3.1

Thanks
Barry