Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2

From: Vincent Guittot
Date: Mon Dec 23 2019 - 08:31:57 EST


On Fri, 20 Dec 2019 at 09:42, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> Changelog since V1
> o Alter code flow vincent.guittot
> o Use idle CPUs for comparison instead of sum_nr_running vincent.guittot
> o Note that the division is still in place. Without it and taking
> imbalance_adj into account before the cutoff, two NUMA domains
> do not converage as being equally balanced when the number of
> busy tasks equals the size of one domain (50% of the sum).
> Some data is in the changelog.
>
> The CPU load balancer balances between different domains to spread load
> and strives to have equal balance everywhere. Communicating tasks can
> migrate so they are topologically close to each other but these decisions
> are independent. On a lightly loaded NUMA machine, two communicating tasks
> pulled together at wakeup time can be pushed apart by the load balancer.
> In isolation, the load balancer decision is fine but it ignores the tasks
> data locality and the wakeup/LB paths continually conflict. NUMA balancing
> is also a factor but it also simply conflicts with the load balancer.
>
> This patch allows a degree of imbalance to exist between NUMA domains
> based on the imbalance_pct defined by the scheduler domain. This slight
> imbalance is allowed until the scheduler domain reaches almost 50%
> utilisation at which point other factors like HT utilisation and memory
> bandwidth come into play. While not commented upon in the code, the cutoff
> is important for memory-bound parallelised non-communicating workloads
> that do not fully utilise the entire machine. This is not necessarily the
> best universal cut-off point but it appeared appropriate for a variety
> of workloads and machines.
>
> The most obvious impact is on netperf TCP_STREAM -- two simple
> communicating tasks with some softirq offloaded depending on the
> transmission rate.
>
> 2-socket Haswell machine 48 core, HT enabled
> netperf-tcp -- mmtests config config-network-netperf-unbound
> baseline lbnuma-v1
> Hmean 64 666.68 ( 0.00%) 667.31 ( 0.09%)
> Hmean 128 1276.18 ( 0.00%) 1288.92 * 1.00%*
> Hmean 256 2366.78 ( 0.00%) 2422.22 * 2.34%*
> Hmean 1024 8123.94 ( 0.00%) 8464.15 * 4.19%*
> Hmean 2048 12962.45 ( 0.00%) 13693.79 * 5.64%*
> Hmean 3312 17709.24 ( 0.00%) 17494.23 ( -1.21%)
> Hmean 4096 19756.01 ( 0.00%) 19472.58 ( -1.43%)
> Hmean 8192 27469.59 ( 0.00%) 27787.32 ( 1.16%)
> Hmean 16384 30062.82 ( 0.00%) 30657.62 * 1.98%*
> Stddev 64 2.64 ( 0.00%) 2.09 ( 20.76%)
> Stddev 128 6.22 ( 0.00%) 6.48 ( -4.28%)
> Stddev 256 9.75 ( 0.00%) 22.85 (-134.30%)
> Stddev 1024 69.62 ( 0.00%) 58.41 ( 16.11%)
> Stddev 2048 72.73 ( 0.00%) 83.47 ( -14.77%)
> Stddev 3312 412.35 ( 0.00%) 75.77 ( 81.63%)
> Stddev 4096 345.02 ( 0.00%) 297.01 ( 13.91%)
> Stddev 8192 280.09 ( 0.00%) 485.36 ( -73.29%)
> Stddev 16384 452.99 ( 0.00%) 250.21 ( 44.76%)
>
> Fairly small impact on average performance but note how much the standard
> deviation is reduced in many cases. A clearer story is visible from the
> NUMA Balancing stats
>
> Ops NUMA base-page range updates 21596.00 282.00
> Ops NUMA PTE updates 21596.00 282.00
> Ops NUMA PMD updates 0.00 0.00
> Ops NUMA hint faults 17786.00 137.00
> Ops NUMA hint local faults % 9916.00 137.00
> Ops NUMA hint local percent 55.75 100.00
> Ops NUMA pages migrated 4231.00 0.00
>
> Without the patch, only 55.75% of sampled accesses are local.
> With the patch, 100% of sampled accesses are local. A 2-socket
> Broadwell showed better results on average but are not presented
> for brevity. The patch holds up for 4-socket boxes as well
>
> 4-socket Haswell machine, 144 core, HT enabled
> netperf-tcp
>
> baseline lbnuma-v1
> Hmean 64 953.51 ( 0.00%) 977.27 * 2.49%*
> Hmean 128 1826.48 ( 0.00%) 1863.37 * 2.02%*
> Hmean 256 3295.19 ( 0.00%) 3329.37 ( 1.04%)
> Hmean 1024 10915.40 ( 0.00%) 11339.60 * 3.89%*
> Hmean 2048 17833.82 ( 0.00%) 19066.12 * 6.91%*
> Hmean 3312 22690.72 ( 0.00%) 24048.92 * 5.99%*
> Hmean 4096 24422.23 ( 0.00%) 26606.60 * 8.94%*
> Hmean 8192 31250.11 ( 0.00%) 33374.62 * 6.80%*
> Hmean 16384 37033.70 ( 0.00%) 38684.28 * 4.46%*
> Hmean 16384 37033.70 ( 0.00%) 38732.22 * 4.59%*
>
> On this machine, the baseline measured 58.11% locality for sampled accesses
> and 100% local accesses with the patch. Similarly, the patch holds up
> for 2-socket machines with multiple L3 caches such as the AMD Epyc 2
>
> 2-socket EPYC-2 machine, 256 cores
> netperf-tcp
> Hmean 64 1564.63 ( 0.00%) 1550.59 ( -0.90%)
> Hmean 128 3028.83 ( 0.00%) 3030.48 ( 0.05%)
> Hmean 256 5733.47 ( 0.00%) 5769.51 ( 0.63%)
> Hmean 1024 18936.04 ( 0.00%) 19216.15 * 1.48%*
> Hmean 2048 27589.77 ( 0.00%) 28200.45 * 2.21%*
> Hmean 3312 35361.97 ( 0.00%) 35881.94 * 1.47%*
> Hmean 4096 37965.59 ( 0.00%) 38702.01 * 1.94%*
> Hmean 8192 48499.92 ( 0.00%) 49530.62 * 2.13%*
> Hmean 16384 54249.96 ( 0.00%) 55937.24 * 3.11%*
>
> For amusement purposes, here are two graphs showing CPU utilisation on
> the 2-socket Haswell machine over time based on mpstat with the ordering
> of the CPUs based on topology.
>
> http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-baseline.png
> http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-lbnuma-v1r1.png
>
> The lines on the left match up CPUs that are HT siblings or on the same
> node. The machine has only one L3 cache per NUMA node or that would also
> be shown. It should be very clear from the images that the baseline
> kernel spread the load with lighter utilisation across nodes while the
> patched kernel had heavy utilisation of fewer CPUs on one node.
>
> Hackbench generally shows good results across machines with some
> differences depending on whether threads or sockets are used as well as
> pipes or sockets. This is the *worst* result from the 2-socket Haswell
> machine
>
> 2-socket Haswell machine 48 core, HT enabled
> hackbench-process-pipes -- mmtests config config-scheduler-unbound
> 5.5.0-rc1 5.5.0-rc1
> baseline lbnuma-v1
> Amean 1 1.2580 ( 0.00%) 1.2393 ( 1.48%)
> Amean 4 5.3293 ( 0.00%) 5.2683 * 1.14%*
> Amean 7 8.9067 ( 0.00%) 8.7130 * 2.17%*
> Amean 12 14.9577 ( 0.00%) 14.5773 * 2.54%*
> Amean 21 25.9570 ( 0.00%) 25.6657 * 1.12%*
> Amean 30 37.7287 ( 0.00%) 37.1277 * 1.59%*
> Amean 48 61.6757 ( 0.00%) 60.0433 * 2.65%*
> Amean 79 100.4740 ( 0.00%) 98.4507 ( 2.01%)
> Amean 110 141.2450 ( 0.00%) 136.8900 * 3.08%*
> Amean 141 179.7747 ( 0.00%) 174.5110 * 2.93%*
> Amean 172 221.0700 ( 0.00%) 214.7857 * 2.84%*
> Amean 192 245.2007 ( 0.00%) 238.3680 * 2.79%*
>
> An earlier prototype of the patch showed major regressions for NAS C-class
> when running with only half of the available CPUs -- 20-30% performance
> hits were measured at the time. With this version of the patch, the impact
> is marginal. In this case, the patch is lbnuma-v2 where as nodivide is a
> patch discussed during review that avoids a divide by putting the cutoff
> at exactly 50% instead of accounting for imbalance_adj.
>
> NAS-C class OMP -- mmtests config hpc-nas-c-class-omp-half
> baseline nodivide lbnuma-v1
> Amean bt.C 64.29 ( 0.00%) 76.33 * -18.72%* 69.55 * -8.17%*
> Amean cg.C 26.33 ( 0.00%) 26.26 ( 0.27%) 26.36 ( -0.11%)
> Amean ep.C 10.26 ( 0.00%) 10.29 ( -0.31%) 10.26 ( -0.04%)
> Amean ft.C 17.98 ( 0.00%) 19.73 * -9.71%* 19.51 * -8.52%*
> Amean is.C 0.99 ( 0.00%) 0.99 ( 0.40%) 0.99 ( 0.00%)
> Amean lu.C 51.72 ( 0.00%) 48.57 ( 6.09%) 48.68 * 5.88%*
> Amean mg.C 8.12 ( 0.00%) 8.27 ( -1.82%) 8.24 ( -1.50%)
> Amean sp.C 82.76 ( 0.00%) 86.06 * -3.99%* 83.42 ( -0.80%)
> Amean ua.C 58.64 ( 0.00%) 57.66 ( 1.67%) 57.79 ( 1.45%)
>
> There is some impact but there is a degree of variability and the ones
> showing impact are mainly workloads that are mostly parallelised
> and communicate infrequently between tests. It's a corner case where
> the workload benefits heavily from spreading wide and early which is
> not common. This is intended to illustrate the worst case measured.
>
> In general, the patch simply seeks to avoid unnecessarily cross-node
> migrations when a machine is lightly loaded but shows benefits for other
> workloads. While tests are still running, so far it seems to benefit
> light-utilisation smaller workloads on large machines and does not appear
> to do any harm to larger or parallelised workloads.
>
> [valentin.schneider@xxxxxxx: Reformat code flow, correct comment, use idle_cpus]
> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++----
> 1 file changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 08a233e97a01..60a780e1420e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8637,10 +8637,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> /*
> * Try to use spare capacity of local group without overloading it or
> * emptying busiest.
> - * XXX Spreading tasks across NUMA nodes is not always the best policy
> - * and special care should be taken for SD_NUMA domain level before
> - * spreading the tasks. For now, load_balance() fully relies on
> - * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
> */
> if (local->group_type == group_has_spare) {
> if (busiest->group_type > group_fully_busy) {
> @@ -8671,6 +8667,39 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> return;
> }
>
> + /* Consider allowing a small imbalance between NUMA groups */
> + if (env->sd->flags & SD_NUMA) {
> + unsigned int imbalance_adj, imbalance_max;
> +
> + /*
> + * imbalance_adj is the allowable degree of imbalance
> + * to exist between two NUMA domains. It's calculated
> + * relative to imbalance_pct with a minimum of two
> + * tasks or idle CPUs. The choice of two is due to
> + * the most basic case of two communicating tasks
> + * that should remain on the same NUMA node after
> + * wakeup.
> + */
> + imbalance_adj = max(2U, (busiest->group_weight *
> + (env->sd->imbalance_pct - 100) / 100) >> 1);
> +
> + /*
> + * Ignore small imbalances unless the busiest sd has
> + * almost half as many busy CPUs as there are
> + * available CPUs in the busiest group. Note that
> + * it is not exactly half as imbalance_adj must be
> + * accounted for or the two domains do not converge
> + * as equally balanced if the number of busy tasks is
> + * roughly the size of one NUMA domain.
> + */
> + imbalance_max = (busiest->group_weight >> 1) + imbalance_adj;
> + if (env->imbalance <= imbalance_adj &&

AFAICT, env->imbalance is undefined there. I have tried your patch
with the below instead

- if (env->imbalance <= imbalance_adj &&
- busiest->idle_cpus >= imbalance_max) {
+ if (busiest->idle_cpus >= imbalance_max) {

Sorry for the delay but running tests tooks more time than expected. I
have applied your patch on top of v5.5-rc3+apparmor fix
I can see an improvement for
hackbench -l (256000/#grp) -g #grp
1 groups 14.197 +/-0.95% 12.127 +/-1.19% (+14.58%)

I haven't seen any difference otherwise

> + busiest->idle_cpus >= imbalance_max) {
> + env->imbalance = 0;
> + return;
> + }
> + }
> +
> if (busiest->group_weight == 1 || sds->prefer_sibling) {
> unsigned int nr_diff = busiest->sum_nr_running;
> /*