Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg

From: Vincent Guittot
Date: Thu Feb 13 2020 - 03:05:50 EST


On Wed, 12 Feb 2020 at 20:49, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, Feb 12, 2020 at 01:37:15PM +0000, Mel Gorman wrote:
> > > static void task_numa_assign(struct task_numa_env *env,
> > > struct task_struct *p, long imp)
> > > {
> > > @@ -1556,6 +1594,11 @@ static bool load_too_imbalanced(long src_load, long dst_load,
> > > long orig_src_load, orig_dst_load;
> > > long src_capacity, dst_capacity;
> > >
> > > +
> > > + /* If dst node has spare capacity, there is no real load imbalance */
> > > + if (env->dst_stats.node_type == node_has_spare)
> > > + return false;
> > > +
> >
> > Not exactly what the load balancer thinks though, the load balancer
> > may decide to balance the tasks between NUMA groups even when there is
> > capacity. That said, what you did here is compatible with the current code.
> >
> > I'll want to test this further but in general
> >
> > Acked-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> >
>
> And I was wrong. I worried that the conflict with the load balancer would
> be a problem when I wrote this comment. That was based on finding that
> I had to account for the load balancer when using capacity to decide
> whether an idle CPU can be used. When I didn't, performance went to
> hell but thought that maybe you had somehow avoided the same problem.
> Unfortunately, testing indicates that the same, or similar, problem is
> here.
>
> This is specjbb running two JVMs on a 2-socket Haswell machine with 48
> cores in total. This is just your first two patches, I have found that

thanks for the test results. Unfortunately i haven't specjbb to
analyse the metrics and study what is going wrong

> the group classify patches do not recover it.
>
> Hmean tput-1 39748.93 ( 0.00%) 39568.77 ( -0.45%)
> Hmean tput-2 88648.59 ( 0.00%) 85302.00 ( -3.78%)
> Hmean tput-3 136285.01 ( 0.00%) 131280.85 ( -3.67%)
> Hmean tput-4 181312.69 ( 0.00%) 179903.17 ( -0.78%)
> Hmean tput-5 228725.85 ( 0.00%) 223499.65 ( -2.28%)
> Hmean tput-6 273246.83 ( 0.00%) 268984.52 ( -1.56%)
> Hmean tput-7 317708.89 ( 0.00%) 314593.10 ( -0.98%)
> Hmean tput-8 362378.08 ( 0.00%) 337205.87 * -6.95%*
> Hmean tput-9 403792.00 ( 0.00%) 349014.39 * -13.57%*
> Hmean tput-10 446000.88 ( 0.00%) 379242.88 * -14.97%*
> Hmean tput-11 486188.58 ( 0.00%) 341951.27 * -29.67%*
> Hmean tput-12 522288.84 ( 0.00%) 388877.07 * -25.54%*
> Hmean tput-13 532394.06 ( 0.00%) 437005.34 * -17.92%*
> Hmean tput-14 539440.66 ( 0.00%) 467479.65 * -13.34%*
> Hmean tput-15 541843.50 ( 0.00%) 455477.03 ( -15.94%)
> Hmean tput-16 546510.71 ( 0.00%) 483599.58 ( -11.51%)
> Hmean tput-17 544501.21 ( 0.00%) 506189.64 ( -7.04%)
> Hmean tput-18 544802.98 ( 0.00%) 535869.05 ( -1.64%)
> Hmean tput-19 545265.27 ( 0.00%) 533050.57 ( -2.24%)
> Hmean tput-20 543284.33 ( 0.00%) 528100.03 ( -2.79%)
> Hmean tput-21 543375.11 ( 0.00%) 528692.97 ( -2.70%)
> Hmean tput-22 542536.60 ( 0.00%) 527160.27 ( -2.83%)
> Hmean tput-23 536402.28 ( 0.00%) 521023.49 ( -2.87%)
> Hmean tput-24 532307.76 ( 0.00%) 519208.85 * -2.46%*
> Stddev tput-1 1426.23 ( 0.00%) 464.57 ( 67.43%)
> Stddev tput-2 4437.10 ( 0.00%) 1489.17 ( 66.44%)
> Stddev tput-3 3021.47 ( 0.00%) 750.95 ( 75.15%)
> Stddev tput-4 4098.39 ( 0.00%) 1678.67 ( 59.04%)
> Stddev tput-5 3524.22 ( 0.00%) 1025.30 ( 70.91%)
> Stddev tput-6 3237.13 ( 0.00%) 2198.39 ( 32.09%)
> Stddev tput-7 2534.27 ( 0.00%) 3261.18 ( -28.68%)
> Stddev tput-8 3847.37 ( 0.00%) 4355.78 ( -13.21%)
> Stddev tput-9 5278.55 ( 0.00%) 4145.06 ( 21.47%)
> Stddev tput-10 5311.08 ( 0.00%) 10274.26 ( -93.45%)
> Stddev tput-11 7537.76 ( 0.00%) 16882.17 (-123.97%)
> Stddev tput-12 5023.29 ( 0.00%) 12735.70 (-153.53%)
> Stddev tput-13 3852.32 ( 0.00%) 15385.23 (-299.38%)
> Stddev tput-14 11859.59 ( 0.00%) 24273.56 (-104.67%)
> Stddev tput-15 16298.10 ( 0.00%) 45409.69 (-178.62%)
> Stddev tput-16 9041.77 ( 0.00%) 58839.77 (-550.75%)
> Stddev tput-17 9322.50 ( 0.00%) 77205.45 (-728.16%)
> Stddev tput-18 16040.01 ( 0.00%) 15021.07 ( 6.35%)
> Stddev tput-19 8814.09 ( 0.00%) 27509.99 (-212.11%)
> Stddev tput-20 7812.82 ( 0.00%) 31578.68 (-304.19%)
> Stddev tput-21 6584.58 ( 0.00%) 3639.48 ( 44.73%)
> Stddev tput-22 8294.36 ( 0.00%) 3033.49 ( 63.43%)
> Stddev tput-23 6887.93 ( 0.00%) 5450.38 ( 20.87%)
> Stddev tput-24 6081.83 ( 0.00%) 390.32 ( 93.58%)
>
> Note that how it reaches the point where the node is almost utilised
> ( near tput-12) that performance drops. Graphing mpstat on a per-node
> basis shows there is imbalance in the CPU utilisation between nodes.

ok. So the has_spare_capacity condition that i added in
load_too_imbalanced() is probably to aggressive although the goal was
to let task moving on their preferred node

>
> NUMA locality is not bad, scanning and migrations are a bit higher but
> overall this is a problem.
>
> On a 4-socket machine, I see slightly different results -- there is still
> a big drop although not a statistically significant one. However, the
> per-node CPU utilisation is skewed as nodes approach being fully utilised.
>
> I see similarly bad results on a 2-socket Broadwell machine.
>
> I don't have other results yet for other workloads but I'm not optimistic
> that I'll see anything different. I still think that this probably didn't
> show up with hackbench because it's too short-lived for NUMA balancing
> to be a factor.
>
> --
> Mel Gorman
> SUSE Labs