Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

From: Jirka Hladky
Date: Wed May 20 2020 - 12:02:10 EST


I have an update on netperf-cstate-small-cross-socket results.

Reported performance degradation of 2.5% for the UDP stream throughput
and 0.6% for the TCP throughput is for message size of 16kB. For
smaller message sizes, the performance drop is higher - up to 5% for
UDP throughput for a message size of 64B. See the numbers below [1]

We still think that it's acceptable given the gains in other
situations (this is again compared to 5.7 vanilla) :

* solved the performance drop upto 20% with single instance
SPECjbb2005 benchmark on 8 NUMA node servers (particularly on AMD EPYC
Rome systems) => this performance drop was INCREASING with higher
threads counts (10% for 16 threads and 20 % for 32 threads)
* solved the performance drop upto 50% for low load scenarios
(SPECjvm2008 and NAS)

[1]
Hillf's patch compared to 5.7 (rc4) vanilla:

TCP throughput
Message size (B)
64 -2.6%
128 -2.3%
256 -2.6%
1024 -2.7%
2048 -2.2%
3312 -2.4%
4096 -1.1%
8192 -0.4%
16384 -0.6%

UDP throughput
64 -5.0%
128 -3.0%
256 -3.0%
1024 -3.1%
2048 -3.3%
3312 -3.5%
4096 -4.0%
8192 -3.3%
16384 -2.6%


On Wed, May 20, 2020 at 3:58 PM Jirka Hladky <jhladky@xxxxxxxxxx> wrote:
>
> Hi Hillf, Mel and all,
>
> thanks for the patch! It has produced really GOOD results.
>
> 1) It has fixed performance problems with 5.7 vanilla kernel for
> single-tenant workload and low system load scenarios, without
> performance degradation for the multi-tenant tasks. It's producing the
> same results as the previous proof-of-concept patch where
> adjust_numa_imbalance function was modified to be a no-op (returning
> the same value of imbalance as it gets on the input).
>
> 2) We have also added Mel's netperf-cstate-small-cross-socket test to
> our test battery:
> https://github.com/gormanm/mmtests/blob/master/configs/config-network-netperf-cstate-small-cross-socket
>
> Mel told me that he had seen significant performance improvements with
> 5.7 over 5.6 for the netperf-cstate-small-cross-socket scenario.
>
> Out of 6 different patches we have tested, your patch has performed
> the best for this scenario. Compared to vanilla, we see minimal
> performance degradation of 2.5% for the udp stream throughput and 0.6%
> for the tcp throughput. The testing was done on a dual-socket system
> with Gold 6132 CPU.
>
> @Mel - could you please test Hillf's patch with your full testing
> suite? So far, it looks very promising, but I would like to check the
> patch thoroughly to make sure it does not hurt performance in other
> areas.
>
> Thanks a lot!
> Jirka
>
>
>
>
>
>
>
>
>
>
>
>
> On Tue, May 19, 2020 at 6:32 AM Hillf Danton <hdanton@xxxxxxxx> wrote:
> >
> >
> > Hi Jirka
> >
> > On Mon, 18 May 2020 16:52:52 +0200 Jirka Hladky wrote:
> > >
> > > We have compared it against kernel with adjust_numa_imbalance disabled
> > > [1], and both kernels perform at the same level for the single-tenant
> > > jobs, but the proposed patch is bad for the multitenancy mode. The
> > > kernel with adjust_numa_imbalance disabled is a clear winner here.
> >
> > Double thanks to you for the tests!
> >
> > > We would be very interested in what others think about disabling
> > > adjust_numa_imbalance function. The patch is bellow. It would be great
> >
> > A minute...
> >
> > > to collect performance results for different scenarios to make sure
> > > the results are objective.
> >
> > I don't have another test case but a diff trying to confine the tool
> > in question back to the hard-coded 2's field.
> >
> > It's used in the first hunk below to detect imbalance before migrating
> > a task, and a small churn of code is added at another call site when
> > balancing idle CPUs.
> >
> > Thanks
> > Hillf
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1916,20 +1916,26 @@ static void task_numa_find_cpu(struct ta
> > * imbalance that would be overruled by the load balancer.
> > */
> > if (env->dst_stats.node_type == node_has_spare) {
> > - unsigned int imbalance;
> > - int src_running, dst_running;
> > + unsigned int imbalance = 2;
> >
> > - /*
> > - * Would movement cause an imbalance? Note that if src has
> > - * more running tasks that the imbalance is ignored as the
> > - * move improves the imbalance from the perspective of the
> > - * CPU load balancer.
> > - * */
> > - src_running = env->src_stats.nr_running - 1;
> > - dst_running = env->dst_stats.nr_running + 1;
> > - imbalance = max(0, dst_running - src_running);
> > - imbalance = adjust_numa_imbalance(imbalance, src_running);
> > + //No imbalance computed without spare capacity
> > + if (env->dst_stats.node_type != env->src_stats.node_type)
> > + goto check_imb;
> > +
> > + imbalance = adjust_numa_imbalance(imbalance,
> > + env->src_stats.nr_running);
> > +
> > + //Do nothing without imbalance
> > + if (!imbalance) {
> > + imbalance = 2;
> > + goto check_imb;
> > + }
> > +
> > + //Migrate task if it's likely to grow balance
> > + if (env->dst_stats.nr_running + 1 < env->src_stats.nr_running)
> > + imbalance = 0;
> >
> > +check_imb:
> > /* Use idle CPU if there is no imbalance */
> > if (!imbalance) {
> > maymove = true;
> > @@ -9011,12 +9017,13 @@ static inline void calculate_imbalance(s
> > env->migration_type = migrate_task;
> > env->imbalance = max_t(long, 0, (local->idle_cpus -
> > busiest->idle_cpus) >> 1);
> > - }
> >
> > - /* Consider allowing a small imbalance between NUMA groups */
> > - if (env->sd->flags & SD_NUMA)
> > - env->imbalance = adjust_numa_imbalance(env->imbalance,
> > - busiest->sum_nr_running);
> > + /* Consider allowing a small imbalance between NUMA groups */
> > + if (env->sd->flags & SD_NUMA &&
> > + local->group_type == busiest->group_type)
> > + env->imbalance = adjust_numa_imbalance(env->imbalance,
> > + busiest->sum_nr_running);
> > + }
> >
> > return;
> > }
> > --
> >
>
>
> --
> -Jirka



--
-Jirka