Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

From: Mike Galbraith
Date: Tue May 16 2023 - 02:25:20 EST


On Tue, 2023-05-16 at 09:11 +0800, Chen Yu wrote:
> [Problem Statement]
>
...

> 20.26%    19.89%  [kernel.kallsyms]          [k] update_cfs_group
> 13.53%    12.15%  [kernel.kallsyms]          [k] update_load_avg

Yup, that's a serious problem, but...

> [Benchmark]
>
> The baseline is on sched/core branch on top of
> commit a6fcdd8d95f7 ("sched/debug: Correct printing for rq->nr_uninterruptible")
>
> Tested will-it-scale context_switch1 case, it shows good improvement
> both on a server and a desktop:
>
> Intel(R) Xeon(R) Platinum 8480+, Sapphire Rapids 2 x 56C/112T = 224 CPUs
> context_switch1_processes -s 100 -t 112 -n
> baseline                   SIS_PAIR
> 1.0                        +68.13%
>
> Intel Core(TM) i9-10980XE, Cascade Lake 18C/36T
> context_switch1_processes -s 100 -t 18 -n
> baseline                   SIS_PAIR
> 1.0                        +45.2%

git@homer: ./context_switch1_processes -s 100 -t 8 -n
(running in an autogroup)

PerfTop: 30853 irqs/sec kernel:96.8% exact: 96.8% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 8 CPUs)
------------------------------------------------------------------------------------------------------------

5.72% [kernel] [k] switch_mm_irqs_off
4.23% [kernel] [k] __update_load_avg_se
3.76% [kernel] [k] __update_load_avg_cfs_rq
3.70% [kernel] [k] __schedule
3.65% [kernel] [k] entry_SYSCALL_64
3.22% [kernel] [k] enqueue_task_fair
2.91% [kernel] [k] update_curr
2.67% [kernel] [k] select_task_rq_fair
2.60% [kernel] [k] pipe_read
2.55% [kernel] [k] __switch_to
2.54% [kernel] [k] __calc_delta
2.44% [kernel] [k] dequeue_task_fair
2.38% [kernel] [k] reweight_entity
2.13% [kernel] [k] pipe_write
1.96% [kernel] [k] restore_fpregs_from_fpstate
1.93% [kernel] [k] select_idle_smt
1.77% [kernel] [k] update_load_avg <==
1.73% [kernel] [k] native_sched_clock
1.66% [kernel] [k] try_to_wake_up
1.52% [kernel] [k] _raw_spin_lock_irqsave
1.47% [kernel] [k] update_min_vruntime
1.42% [kernel] [k] update_cfs_group <==
1.36% [kernel] [k] vfs_write
1.32% [kernel] [k] prepare_to_wait_event

...not one with global scope. My little i7-4790 can play ping-pong all
day long, as can untold numbers of other boxen around the globe.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0ca13ac..e65028dcd6a6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7125,6 +7125,21 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>             asym_fits_cpu(task_util, util_min, util_max, target))
>                 return target;
>  
> +       /*
> +        * If the waker and the wakee are good friends to each other,
> +        * putting them within the same SMT domain could reduce C2C
> +        * overhead. SMT idle sibling should be preferred to wakee's
> +        * previous CPU, because the latter could still have the risk of C2C
> +        * overhead.
> +        */
> +       if (sched_feat(SIS_PAIR) && sched_smt_active() &&
> +           current->last_wakee == p && p->last_wakee == current) {
> +               i = select_idle_smt(p, smp_processor_id());
> +
> +               if ((unsigned int)i < nr_cpumask_bits)
> +                       return i;
> +       }
> +
>         /*
>          * If the previous CPU is cache affine and idle, don't be stupid:
>          */

Global scope solutions for non-global issues tend to not work out.  

Below is a sample of potential scaling wreckage for boxen that are NOT
akin to the one you're watching turn caches into silicon based pudding.

Note the *_RR numbers. Those poked me in the eye because they closely
resemble pipe ping-pong, all fun and games with about as close to zero
work other than scheduling as network-land can get, but for my box, SMT
was the third best option of three.

You just can't beat idle core selection when it comes to getting work
done, which is why SIS evolved to select cores first.

Your box and ilk need help that treats the disease and not the symptom,
or barring that, help that precisely targets boxen having the disease.

-Mike

10 seconds of 1 netperf client/server instance, no knobs twiddled.

TCP_SENDFILE-1 stacked Avg: 65387
TCP_SENDFILE-1 cross-smt Avg: 65658
TCP_SENDFILE-1 cross-core Avg: 96318

TCP_STREAM-1 stacked Avg: 44322
TCP_STREAM-1 cross-smt Avg: 42390
TCP_STREAM-1 cross-core Avg: 77850

TCP_MAERTS-1 stacked Avg: 36636
TCP_MAERTS-1 cross-smt Avg: 42333
TCP_MAERTS-1 cross-core Avg: 74122

UDP_STREAM-1 stacked Avg: 52618
UDP_STREAM-1 cross-smt Avg: 55298
UDP_STREAM-1 cross-core Avg: 97415

TCP_RR-1 stacked Avg: 242606
TCP_RR-1 cross-smt Avg: 140863
TCP_RR-1 cross-core Avg: 219400

UDP_RR-1 stacked Avg: 282253
UDP_RR-1 cross-smt Avg: 202062
UDP_RR-1 cross-core Avg: 288620