Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

From: Mel Gorman
Date: Wed Dec 05 2018 - 05:06:36 EST


On Tue, Dec 04, 2018 at 10:45:58AM +0000, Mel Gorman wrote:
> I have *one* result of the series on a 1-socket machine running
> "thpscale". It creates a file, punches holes in it to create a
> very light form of fragmentation and then tries THP allocations
> using madvise measuring latency and success rates. It's the
> global-dhp__workload_thpscale-madvhugepage in mmtests using XFS as the
> filesystem.
>
> thpscale Fault Latencies
> 4.20.0-rc4 4.20.0-rc4
> mmots-20181130 gfpthisnode-v1r1
> Amean fault-base-3 5358.54 ( 0.00%) 2408.93 * 55.04%*
> Amean fault-base-5 9742.30 ( 0.00%) 3035.25 * 68.84%*
> Amean fault-base-7 13069.18 ( 0.00%) 4362.22 * 66.62%*
> Amean fault-base-12 14882.53 ( 0.00%) 9424.38 * 36.67%*
> Amean fault-base-18 15692.75 ( 0.00%) 16280.03 ( -3.74%)
> Amean fault-base-24 28775.11 ( 0.00%) 18374.84 * 36.14%*
> Amean fault-base-30 42056.32 ( 0.00%) 21984.55 * 47.73%*
> Amean fault-base-32 38634.26 ( 0.00%) 22199.49 * 42.54%*
> Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)
> Amean fault-huge-3 3628.86 ( 0.00%) 963.45 * 73.45%*
> Amean fault-huge-5 4926.42 ( 0.00%) 2959.85 * 39.92%*
> Amean fault-huge-7 6717.15 ( 0.00%) 3828.68 * 43.00%*
> Amean fault-huge-12 11393.47 ( 0.00%) 5772.92 * 49.33%*
> Amean fault-huge-18 16979.38 ( 0.00%) 4435.95 * 73.87%*
> Amean fault-huge-24 16558.00 ( 0.00%) 4416.46 * 73.33%*
> Amean fault-huge-30 20351.46 ( 0.00%) 5099.73 * 74.94%*
> Amean fault-huge-32 23332.54 ( 0.00%) 6524.73 * 72.04%*
>
> So, looks like massive latency improvements but then the THP allocation
> success rates
>
> thpscale Percentage Faults Huge
> 4.20.0-rc4 4.20.0-rc4
> mmots-20181130 gfpthisnode-v1r1
> Percentage huge-3 95.14 ( 0.00%) 7.94 ( -91.65%)
> Percentage huge-5 91.28 ( 0.00%) 5.00 ( -94.52%)
> Percentage huge-7 86.87 ( 0.00%) 9.36 ( -89.22%)
> Percentage huge-12 83.36 ( 0.00%) 21.03 ( -74.78%)
> Percentage huge-18 83.04 ( 0.00%) 30.73 ( -63.00%)
> Percentage huge-24 83.74 ( 0.00%) 27.47 ( -67.20%)
> Percentage huge-30 83.66 ( 0.00%) 31.85 ( -61.93%)
> Percentage huge-32 83.89 ( 0.00%) 29.09 ( -65.32%)
>

Other results arrived once the grid caught up and it's a mixed bag of
gains and losses roughtly along the lines predicted by the discussion
already -- namely locality is better as long as the workload fits,
compaction is reduced, reclaim is reduced, THP allocation success rates
are reduced but latencies are often better.

Whether this is "good" or "bad" depends on whether you have a workload
that benefits because it's neither universally good or bad. It would
still be nice to hear how Andreas fared but I think we'll reach the same
conclusion -- the patches shuffles the problem around with limited effort
to address the root causes so all we end up changing is the identity of
the person who complains about their workload. One might be tempted to
think that the reduced latencies in some cases are great but not if the
workload is one that benefits from longer startup costs in exchange for
lower runtime costs in the active phase.

For the much longer answer, I'll focus on the two-socket results because
they are more relevant to the current discussion. The workloads are
not realistic in the slightest, they just happen to trigger some of the
interesting corner cases.

global-dhp__workload_usemem-stress-numa-compact
o Plain anonymous faulting workload
o defrag=always (not representative, simply triggers a bad case)

4.20.0-rc4 4.20.0-rc4
mmots-20181130 gfpthisnode-v1r1
Amean Elapsd-1 26.79 ( 0.00%) 34.92 * -30.37%*
Amean Elapsd-3 7.32 ( 0.00%) 8.10 * -10.61%*
Amean Elapsd-4 5.53 ( 0.00%) 5.64 ( -1.94%)

Units are seconds, time to complete 30.37% worse for the single-threaded
case. No direct reclaim activity but other activity is interesting and
I'll pick it out snippets;

4.20.0-rc4 4.20.0-rc4
mmots-20181130gfpthisnode-v1r1
Swap Ins 8 0
Swap Outs 1546 0
Allocation stalls 0 0
Fragmentation stalls 0 2022
Direct pages scanned 0 0
Kswapd pages scanned 42719 1078
Kswapd pages reclaimed 41082 1049
Page writes by reclaim 1546 0
Page writes file 0 0
Page writes anon 1546 0
Page reclaim immediate 2 0

Baseline kernel swaps out (bad), David's patch reclaims less (good).
That's reasonably positive. Less positive is that fragmentation stalls are
triggered with David's patch. This is due to a patch of mine in Andrew's
tree which I've asked that he drop as while it helps control long-term
fragmentation, there was always a risk that the short stalls would be
problematic and it's a distraction.

THP fault alloc 540043 456714
THP fault fallback 0 83329
THP collapse alloc 0 4
THP collapse fail 0 0
THP split 1 0
THP split failed 0 0

David's patch falls back to base page allocation to a much higher degree
(bad).

Compaction pages isolated 85381 11432
Compaction migrate scanned 204787 42635
Compaction free scanned 72376 13061
Compact scan efficiency 282% 326%

David's patch also compacts less.

NUMA alloc hit 1188182 1093244
NUMA alloc miss 68199 42764192
NUMA interleave hit 0 0
NUMA alloc local 1179614 1084665
NUMA base PTE updates 28902547 23270389
NUMA huge PMD updates 56437 45438
NUMA page range updates 57798291 46534645
NUMA hint faults 61395 47838
NUMA hint local faults 46440 47833
NUMA hint local percent 75% 99%
NUMA pages migrated 2000156 5

Interestingly, the NUMA misses are higher with David's patch indicating
that it's allocating *more* from remote nodes. However, there are also
hints that the accessing process then removes to the remote node instead
of the current mmotm kernel which tries to migrate the memory locally.

So, in line with expectations. The baseline kernel works harder to
allocate the THPs where as David's gives up quickly and moves over. At
one level this is good but the bottom line is total time to complete the
workload goes from the baseline of 280 seconds up to 344 seconds. This
is overall mixed because depending on what you look at, it's both good
and bad.

global-dhp__workload_thpscale-xfs
o Workload creates a large file, punches holes in it
o Mapping is created and faulted to measure allocation success rates and
latencies
o No special madvise
o Considered a relatively "simple" case

4.20.0-rc4 4.20.0-rc4
mmots-20181130 gfpthisnode-v1r1
Amean fault-base-3 2021.05 ( 0.00%) 2633.11 * -30.28%*
Amean fault-base-5 2475.25 ( 0.00%) 2997.15 * -21.08%*
Amean fault-base-7 5595.79 ( 0.00%) 7523.10 ( -34.44%)
Amean fault-base-12 15604.91 ( 0.00%) 16355.02 ( -4.81%)
Amean fault-base-18 20277.13 ( 0.00%) 22062.73 ( -8.81%)
Amean fault-base-24 24218.46 ( 0.00%) 25772.49 ( -6.42%)
Amean fault-base-30 28516.75 ( 0.00%) 28208.14 ( 1.08%)
Amean fault-base-32 36722.30 ( 0.00%) 20712.46 * 43.60%*
Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)
Amean fault-huge-3 685.38 ( 0.00%) 512.02 * 25.29%*
Amean fault-huge-5 3639.75 ( 0.00%) 807.33 ( 77.82%)
Amean fault-huge-7 1139.54 ( 0.00%) 555.45 * 51.26%*
Amean fault-huge-12 1012.64 ( 0.00%) 850.68 ( 15.99%)
Amean fault-huge-18 6694.45 ( 0.00%) 1310.39 * 80.43%*
Amean fault-huge-24 10165.27 ( 0.00%) 3822.23 * 62.40%*
Amean fault-huge-30 13496.19 ( 0.00%) 19248.06 * -42.62%*
Amean fault-huge-32 4477.05 ( 0.00%) 63463.78 *-1317.54%*

These latency outliers can be huge so take them with a grain of salt.
Sometimes I'll look at the percentiles but it takes an age to discuss.

In general, David's patch faults huge pages faster, particularly with
higher threads. The allocation success rates are also great

4.20.0-rc4 4.20.0-rc4
mmots-20181130 gfpthisnode-v1r1
Percentage huge-3 2.86 ( 0.00%) 26.48 ( 825.27%)
Percentage huge-5 1.07 ( 0.00%) 1.41 ( 31.16%)
Percentage huge-7 20.38 ( 0.00%) 54.82 ( 168.94%)
Percentage huge-12 19.07 ( 0.00%) 38.10 ( 99.76%)
Percentage huge-18 10.72 ( 0.00%) 30.18 ( 181.49%)
Percentage huge-24 8.44 ( 0.00%) 15.48 ( 83.39%)
Percentage huge-30 7.41 ( 0.00%) 10.78 ( 45.38%)
Percentage huge-32 29.08 ( 0.00%) 3.23 ( -88.91%)

Overall system activity looks similar which is counter-intuitive. The
only hints of what is going on is that David's patch reclaims less from
kswapd context. Direct reclaim scanning is high in both cases but does
not reclaim much. David's patch scans for free page as compaction
targets much more aggressively but no indication as to why. Locality
information looks similar.

So, not as sure what to think about this. Headline results look good but
no obvious explanation as to why exactly. It could be that stalls
(higher with David's patch) mean there is less inteference between
threads but that's thin.

global-dhp__workload_thpscale-madvhugepage-xfs
o Same as above except that MADV_HUGEPAGE is used

4.20.0-rc4 4.20.0-rc4
mmots-20181130 gfpthisnode-v1r1
Amean fault-base-1 0.00 ( 0.00%) 0.00 ( 0.00%)
Amean fault-base-3 18880.35 ( 0.00%) 6341.60 * 66.41%*
Amean fault-base-5 27608.74 ( 0.00%) 6515.10 * 76.40%*
Amean fault-base-7 28345.03 ( 0.00%) 7529.98 * 73.43%*
Amean fault-base-12 35690.33 ( 0.00%) 13518.77 * 62.12%*
Amean fault-base-18 56538.31 ( 0.00%) 23933.91 * 57.67%*
Amean fault-base-24 71485.33 ( 0.00%) 26927.03 * 62.33%*
Amean fault-base-30 54286.39 ( 0.00%) 23453.61 * 56.80%*
Amean fault-base-32 92143.50 ( 0.00%) 19474.99 * 78.86%*
Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)
Amean fault-huge-3 5666.72 ( 0.00%) 1351.55 * 76.15%*
Amean fault-huge-5 8307.35 ( 0.00%) 2776.28 * 66.58%*
Amean fault-huge-7 10651.96 ( 0.00%) 2397.70 * 77.49%*
Amean fault-huge-12 15489.56 ( 0.00%) 7034.98 * 54.58%*
Amean fault-huge-18 20278.54 ( 0.00%) 6417.46 * 68.35%*
Amean fault-huge-24 29378.24 ( 0.00%) 16173.41 * 44.95%*
Amean fault-huge-30 29237.66 ( 0.00%) 81198.70 *-177.72%*
Amean fault-huge-32 27177.37 ( 0.00%) 18966.08 * 30.21%*

Superb improvement in latencies coupled with the following

4.20.0-rc4 4.20.0-rc4
mmots-20181130 gfpthisnode-v1r1
Percentage huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)
Percentage huge-3 99.74 ( 0.00%) 49.62 ( -50.25%)
Percentage huge-5 99.24 ( 0.00%) 12.19 ( -87.72%)
Percentage huge-7 97.98 ( 0.00%) 19.20 ( -80.40%)
Percentage huge-12 95.76 ( 0.00%) 21.33 ( -77.73%)
Percentage huge-18 94.91 ( 0.00%) 31.63 ( -66.67%)
Percentage huge-24 94.36 ( 0.00%) 9.27 ( -90.18%)
Percentage huge-30 92.15 ( 0.00%) 9.60 ( -89.58%)
Percentage huge-32 94.18 ( 0.00%) 8.67 ( -90.79%)

THP allocation success rates are through the floor which is why
latencies overall are better.

This goes back to the fundamental question -- does your workload benefit
from THP or not and is it the primary metric? If yes (potentially in the
case with KVM) then this is a disaster. It's actually a mixed bad for
David because THP was desired but so was locality. In this case, the
application specifically requested THP so presumably a real application
specifying the flag means it

The high-level system stats reflect the level of effort, David's patch
does less work in the system which is both good and bad depending on
your requirements

4.20.0-rc4 4.20.0-rc4
mmots-20181130gfpthisnode-v1r1
Swap Ins 1564 0
Swap Outs 12283 163
Allocation stalls 30236 24
Fragmentation stalls 1069 24683

Baseline kernel swaps and has high allocation stalls to reclaim memory.
David's patch stalls on trying to control fragmentation instead.

4.20.0-rc4 4.20.0-rc4
mmots-20181130gfpthisnode-v1r1
Direct pages scanned 12780511 9955217
Kswapd pages scanned 1944181 16554296
Kswapd pages reclaimed 870023 4029534
Direct pages reclaimed 6738924 5884
Kswapd efficiency 44% 24%
Kswapd velocity 1308.975 11200.850
Direct efficiency 52% 0%
Direct velocity 8604.840 6735.828

The baseline kernel does much of the reclaim work in direct context
while David's does it in kswapd context.

THP fault alloc 316843 238810
THP fault fallback 17224 95256
THP collapse alloc 2 0
THP collapse fail 0 5
THP split 177536 180673
THP split failed 10024 2

Baseline kernel allocates THP while David's falls back

THP collapse alloc 2 0
THP collapse fail 0 5
Compaction stalls 100198 75267
Compaction success 65803 3964
Compaction failures 34395 71303
Compaction efficiency 65% 5%
Page migrate success 40807601 17963914
Page migrate failure 16206 16782
Compaction pages isolated 90818819 41285100
Compaction migrate scanned 98628306 36990342
Compaction free scanned 6547623619 6870889207

Unsurprisingly, David's patch tries to compact less. The collapse activity
shows that enough time didn't pass for khugepaged to intervene. While
outside the context of the current discussion, that compaction scanning
activity is mental but also unsurprising. A lot of it is from kcompactd
activity. It's a seperate series to deal with that.

Given the mix of gains and losses, the patch simply shuffles the problem
around in a circle. Some workloads benefit, some don't and whether it's
merged or not merged, someone ends up annoyed as their workload suffers.

I know I didn't review the patch in much detail because in this context,
it was more interesting to know "what it does" than the specifics of the
approach. I'm going to go back to hopping my face off the compaction
series because I think it has the potential to reduce the problem
overall instead of shuffling the deckchairs around the titanic[1]

[1] Famous last words, the series could end up being the iceberg

--
Mel Gorman
SUSE Labs