Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU

From: Mel Gorman
Date: Wed Sep 12 2018 - 06:52:19 EST


On Wed, Sep 12, 2018 at 12:24:10PM +0530, Srikar Dronamraju wrote:
> > > > Srikar's patch here:
> > > >
> > > > http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@xxxxxxxxxxxxxxxxxx
> > > >
> > > > Also frobs this condition, but in a less radical way. Does that yield
> > > > similar results?
> > >
> > > I can check. I do wonder of course if the less radical approach just means
> > > that automatic NUMA balancing and the load balancer simply disagree about
> > > placement at a different time. It'll take a few days to have an answer as
> > > the battery of workloads to check this take ages.
> > >
> >
> > Tests completed over the weekend and I've found that the performance of
> > both patches are very similar for two machines (both 2 socket) running a
> > variety of workloads. Hence, I'm not worried about which patch gets picked
> > up. However, I would prefer my own on the grounds that the additional
> > complexity does not appear to get us anything. Of course, that changes if
> > Srikar's tests on his larger ppc64 machines show the more complex approach
> > is justified.
> >
>
> Running SPECJbb2005. Higher bops are better.
>
> Kernel A = 4.18+ 13 sched patches part of v4.19-rc1.
> Kernel B = Kernel A + 6 patches (http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-srikar@xxxxxxxxxxxxxxxxxx)
> Kernel C = Kernel B - (Avoid task migration for small numa improvement) i.e
> http://lore.kernel.org/lkml/1533276841-16341-4-git-send-email-srikar@xxxxxxxxxxxxxxxxxx
> + 2 patches from Mel
> (Do not move imbalanced load purely)
> http://lore.kernel.org/lkml/20180907101139.20760-5-mgorman@xxxxxxxxxxxxxxxxxxx
> (Stop comparing tasks for NUMA placement)
> http://lore.kernel.org/lkml/20180907101139.20760-4-mgorman@xxxxxxxxxxxxxxxxxxx
>

We ended up comparing different things. I started with 4.19-rc1 with
patches 1-3 from my series. I then compared my patch "Do not move
imbalanced load" with just yours so there was one point of variability.
You compared one patch of yours against two of mine so we ended up
looking at different things. That aside;

> 2 node x86 Haswell
>
> v4.18 or 94710cac0ef4
> JVMS Prev Current %Change
> 4 203769
> 1 316734
>
> Kernel A
> JVMS Prev Current %Change
> 4 203769 209790 2.95482
> 1 316734 312377 -1.3756
>
> Kernel B
> JVMS Prev Current %Change
> 4 209790 202059 -3.68511
> 1 312377 326987 4.67704
>
> Kernel C
> JVMS Prev Current %Change
> 4 202059 200681 -0.681979
> 1 326987 316715 -3.14141

Overall, this is actually not that promising. With Kernel B, we lose
almost as much as we gain depending on the thread count but the data
presented is very limited.

I can correlate though that the Haswll results make some sense. The
baseline here is my patch and it's comparing against yours

2 socket Haswell machine
1 JVM -- 1-3.5% gain -- http://www.skynet.ie/~mel/postings/numab-20180912/global-dhp__jvm-specjbb2005-single/marvin5/
2 JVM -- 1-6.3% gain -- http://www.skynet.ie/~mel/postings/numab-20180912/global-dhp__jvm-specjbb2005-multi/marvin5/

Results are similarly good for Broadwell. But it's not universal. It's
both good and bad with specjbb2015 depending on the CPU family. For
example, on Haswell with one JVM per load I see a 5% gain with your
patch and with Broadwell, I see an 8% loss.

For tbench, yours was a marginal loss, autonuma benchmark was a win on
one machine, a loss on another. hackbench showed both gains and losses.
All the results I had were marginal which is why I was not convinced the
complexity was justified.

Your ppc64 figures look a bit more convincing and while I'm disappointed
that you did not make a like-like comparison, I'm happy enough to go with
your version. I can re-evaluate "Stop comparing tasks for NUMA placement"
on its own later as well as the fast-migrate patches.

Thanks Srikar.

--
Mel Gorman
SUSE Labs