Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2

From: Phil Auld
Date: Tue Jan 07 2020 - 14:26:40 EST



Hi,

On Tue, Jan 07, 2020 at 09:56:55AM +0000 Mel Gorman wrote:
>
> util_avg can be skewed if there are big outliers. Even then, it's not
> a great metric for the low utilisation cutoff. Large numbers of mostly
> idle but running tasks would be treated similarly to small numbers of
> fully active tasks. It's less predictable and harder to reason about how
> load balancing behaves across a variety of workloads.
>
> Based on what you suggest, the result looks like this (build tested
> only)

(Here I'm calling the below patch v4 for lack of a better name.)

One of my concerns is to have the group imbalance issue addressed. This
is the one remaining issue from the wasted cores paper. I have a setup
that is designed to illustrate this case. I ran a number of tests with
the small imbalance patches (v3 and v4 in this case) and both before
and after Vincent's load balancing rework.

The basic test is to run an LU.c benchmark from the NAS parallel benchmark
suite along with a couple of other cpu burning tasks. The GROUP case is LU
and each cpu hog in separate cgroups. The NORMAL case is all of these in
one cgroup. This shows of the problems of averaging the group
scheduling load by failing to balance the jobs across the NUMA nodes.
It ends up with idle CPUs in the nodes where the cpu hogs are running while
overloading LU.c threads on others, with a big impact on the benchmark's
performance. This test benefits from getting balanced well quickly.


The test machine is a 4-node 80 cpu x86_64 system (smt on). There are 76
threads in the LU.c test and 2 stress cpu jobs. Each row shows the numbers
for 10 runs to smooth it out and make the mean more, well, meaningful.
It's still got a fair bit of variance as you can see from the 3 sets of
data points for each kernel.

5.4.0 is before load balancing rework (the really bad case).
5.5-rc2 is with the load balancing rework.
lbv3 is Mel's posted v3 patch on top of 5.5-rc2
lbv4 is Mel's experimental v4 which is from email discussion with Vincent.


lbv4 appears a little worse for the GROUP case. v3 and 5.5-rc2 are pretty
close to the same.

All of the post 5.4.0 cases lose a little on the NORMAL case. lbv3 seems
to get a fair bit of that loss back on average but with a bit higher
variability.


This test can be pretty variable though so the minor differences probably
don't mean that much. In all the post re-work cases we are still showing
vast improvement in the GROUP case, which given the common use of cgroups
in modern workloads is a good thing.

----------------------------------

GROUP - LU.c and cpu hogs in separate cgroups
Mop/s - Higher is better
============76_GROUP========Mop/s===================================
min q1 median q3 max
5.4.0 1671.8 4211.2 6103.0 6934.1 7865.4
5.4.0 1777.1 3719.9 4861.8 5822.5 13479.6
5.4.0 2015.3 2716.2 5007.1 6214.5 9491.7
5.5-rc2 27641.0 30684.7 32091.8 33417.3 38118.1
5.5-rc2 27386.0 29795.2 32484.1 36004.0 37704.3
5.5-rc2 26649.6 29485.0 30379.7 33116.0 36832.8
lbv3 28496.3 29716.0 30634.8 32998.4 40945.2
lbv3 27294.7 29336.4 30186.0 31888.3 35839.1
lbv3 27099.3 29325.3 31680.1 35973.5 39000.0
lbv4 27936.4 30109.0 31724.8 33150.7 35905.1
lbv4 26431.0 29355.6 29850.1 32704.4 36060.3
lbv4 27436.6 29945.9 31076.9 32207.8 35401.5

Runtime - Lower is better
============76_GROUP========time====================================
min q1 median q3 max
5.4.0 259.2 294.92 335.39 484.33 1219.61
5.4.0 151.3 351.1 419.4 551.99 1147.3
5.4.0 214.8 328.16 407.27 751.03 1011.77
5.5-rc2 53.49 61.03 63.56 66.46 73.77
5.5-rc2 54.08 56.67 62.78 68.44 74.45
5.5-rc2 55.36 61.61 67.14 69.16 76.51
lbv3 49.8 61.8 66.59 68.62 71.55
lbv3 56.89 63.95 67.55 69.51 74.7
lbv3 52.28 56.68 64.38 69.54 75.24
lbv4 56.79 61.52 64.3 67.73 72.99
lbv4 56.54 62.36 68.31 69.47 77.14
lbv4 57.6 63.33 65.64 68.11 74.32

NORMAL - LU.c and cpu hogs all in one cgroup
Mop/s - Higher is better
============76_NORMAL========Mop/s===================================
min q1 median q3 max
5.4.0 32912.6 34047.5 36739.4 39124.1 41592.5
5.4.0 29937.7 33060.6 34860.8 39528.8 43328.1
5.4.0 31851.2 34281.1 35284.4 36016.8 38847.4
5.5-rc2 30475.6 32505.1 33977.3 34876 36233.8
5.5-rc2 30657.7 31301.1 32059.4 34396.7 38661.8
5.5-rc2 31022 32247.6 32628.9 33245 38572.3
lbv3 30606.4 32794.4 34258.6 35699 38669.2
lbv3 29722.7 30558.9 32731.2 36412 40752.3
lbv3 30297.7 32568.3 36654.6 38066.2 38988.3
lbv4 30084.9 31227.5 32312.8 33222.8 36039.7
lbv4 29875.9 32903.6 33803.1 34519.3 38663.5
lbv4 27923.3 30631.1 32666.9 33516.7 36663.4

Runtime - Lower is better
============76_NORMAL========time====================================
min q1 median q3 max
5.4.0 49.02 52.115 55.58 59.89 61.95
5.4.0 47.06 51.615 58.57 61.68 68.11
5.4.0 52.49 56.615 57.795 59.48 64.02
5.5-rc2 56.27 58.47 60.02 62.735 66.91
5.5-rc2 52.74 59.295 63.605 65.145 66.51
5.5-rc2 52.86 61.335 62.495 63.23 65.73
lbv3 52.73 57.12 59.52 62.19 66.62
lbv3 50.03 56.02 62.39 66.725 68.6
lbv3 52.3 53.565 55.65 62.645 67.3
lbv4 56.58 61.375 63.135 65.3 67.77
lbv4 52.74 59.07 60.335 61.97 68.25
lbv4 55.61 60.835 62.42 66.635 73.02


So aside from the theoretical disputes the posted v3 seems reasonable.
When a final version comes toghether I'll have the perf team run a
fuller set of tests.


Cheers,
Phil


>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ba749f579714..1b2c7bed2db5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> /*
> * Try to use spare capacity of local group without overloading it or
> * emptying busiest.
> - * XXX Spreading tasks across NUMA nodes is not always the best policy
> - * and special care should be taken for SD_NUMA domain level before
> - * spreading the tasks. For now, load_balance() fully relies on
> - * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
> */
> if (local->group_type == group_has_spare) {
> if (busiest->group_type > group_fully_busy) {
> @@ -8691,16 +8687,41 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> env->migration_type = migrate_task;
> lsub_positive(&nr_diff, local->sum_nr_running);
> env->imbalance = nr_diff >> 1;
> - return;
> - }
> + } else {
>
> - /*
> - * If there is no overload, we just want to even the number of
> - * idle cpus.
> - */
> - env->migration_type = migrate_task;
> - env->imbalance = max_t(long, 0, (local->idle_cpus -
> + /*
> + * If there is no overload, we just want to even the number of
> + * idle cpus.
> + */
> + env->migration_type = migrate_task;
> + env->imbalance = max_t(long, 0, (local->idle_cpus -
> busiest->idle_cpus) >> 1);
> + }
> +
> + /* Consider allowing a small imbalance between NUMA groups */
> + if (env->sd->flags & SD_NUMA) {
> + struct sched_domain *child = env->sd->child;
> + unsigned int imbalance_adj;
> +
> + /*
> + * Calculate an acceptable degree of imbalance based
> + * on imbalance_adj. However, do not allow a greater
> + * imbalance than the child domains weight to avoid
> + * a case where the allowed imbalance spans multiple
> + * LLCs.
> + */
> + imbalance_adj = busiest->group_weight * (env->sd->imbalance_pct - 100) / 100;
> + imbalance_adj = min(imbalance_adj, child->span_weight);
> + imbalance_adj >>= 1;
> +
> + /*
> + * Ignore small imbalances when the busiest group has
> + * low utilisation.
> + */
> + if (busiest->sum_nr_running < imbalance_adj)
> + env->imbalance = 0;
> + }
> +
> return;
> }
>
>

--