Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

From: David Rientjes
Date: Tue Dec 11 2018 - 19:37:29 EST


On Sun, 9 Dec 2018, Andrea Arcangeli wrote:

> You didn't release the proprietary software that depends on
> __GFP_THISNODE behavior and that you're afraid is getting a
> regression.
>
> Could you at least release with an open source license the benchmark
> software that you must have used to do the above measurement to
> understand why it gives such a weird result on remote THP?
>

Hi Andrea,

As I said in response to Linus, I'm in the process of writing a more
complete benchmarking test across all of our platforms for access and
allocation latency for x86 (both Intel and AMD), POWER8/9, and arm64, and
doing so on a kernel with minimum overhead (for the allocation latency, I
want to remove things like mem cgroup overhead from the result).

> On skylake and on the threadripper I can't confirm that there isn't a
> significant benefit from cross socket hugepage over cross socket small
> page.
>
> Skylake Xeon(R) Gold 5115:
>
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
> node 0 size: 15602 MB
> node 0 free: 14077 MB
> node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
> node 1 size: 16099 MB
> node 1 free: 15949 MB
> node distances:
> node 0 1
> 0: 10 21
> 1: 21 10
> # numactl -m 0 -C 0 ./numa-thp-bench
> random writes MADV_HUGEPAGE 10109753 usec
> random writes MADV_NOHUGEPAGE 13682041 usec
> random writes MADV_NOHUGEPAGE 13704208 usec
> random writes MADV_HUGEPAGE 10120405 usec
> # numactl -m 0 -C 10 ./numa-thp-bench
> random writes MADV_HUGEPAGE 15393923 usec
> random writes MADV_NOHUGEPAGE 19644793 usec
> random writes MADV_NOHUGEPAGE 19671287 usec
> random writes MADV_HUGEPAGE 15495281 usec
> # grep Xeon /proc/cpuinfo |head -1
> model name : Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz
>
> local 4k -> local 2m: +35%
> local 4k -> remote 2m: -11%
> remote 4k -> remote 2m: +26%
>
> threadripper 1950x:
>
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
> node 0 size: 15982 MB
> node 0 free: 14422 MB
> node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
> node 1 size: 16124 MB
> node 1 free: 5357 MB
> node distances:
> node 0 1
> 0: 10 16
> 1: 16 10
> # numactl -m 0 -C 0 /tmp/numa-thp-bench
> random writes MADV_HUGEPAGE 12902667 usec
> random writes MADV_NOHUGEPAGE 17543070 usec
> random writes MADV_NOHUGEPAGE 17568858 usec
> random writes MADV_HUGEPAGE 12896588 usec
> # numactl -m 0 -C 8 /tmp/numa-thp-bench
> random writes MADV_HUGEPAGE 19663515 usec
> random writes MADV_NOHUGEPAGE 27819864 usec
> random writes MADV_NOHUGEPAGE 27844066 usec
> random writes MADV_HUGEPAGE 19662706 usec
> # grep Threadripper /proc/cpuinfo |head -1
> model name : AMD Ryzen Threadripper 1950X 16-Core Processor
>
> local 4k -> local 2m: +35%
> local 4k -> remote 2m: -10%
> remote 4k -> remote 2m: +41%
>
> Or if you prefer reversed in terms of compute time (negative
> percentage is better in this case):
>
> local 4k -> local 2m: -26%
> local 4k -> remote 2m: +12%
> remote 4k -> remote 2m: -29%
>
> It's true that local 4k is generally a win vs remote THP when the
> workload is memory bound also for the threadripper, the threadripper
> seems even more favorable to remote THP than skylake Xeon is.
>

My results are organized slightly different since it considers local
hugepages as the baseline and is what we optimize for: on Broadwell, I've
obtained more accurate results that show local small pages at +3.8%,
remote hugepages at +12.8% and remote small pages at +18.8%. I think we
both agree that the locality preference for workloads that fit within a
single node is local hugepage -> local small page -> remote hugepage ->
remote small page, and that has been unchanged in any of benchmarking
results for either of us.

> The above is the host bare metal result. Now let's try guest mode on
> the threadripper. The last two lines seems more reliable (the first
> two lines also needs to fault in the guest RAM because the guest
> was fresh booted).
>
> guest backed by local 2M pages:
>
> random writes MADV_HUGEPAGE 16025855 usec
> random writes MADV_NOHUGEPAGE 21903002 usec
> random writes MADV_NOHUGEPAGE 19762767 usec
> random writes MADV_HUGEPAGE 15189231 usec
>
> guest backed by remote 2M pages:
>
> random writes MADV_HUGEPAGE 25434251 usec
> random writes MADV_NOHUGEPAGE 32404119 usec
> random writes MADV_NOHUGEPAGE 31455592 usec
> random writes MADV_HUGEPAGE 22248304 usec
>
> guest backed by local 4k pages:
>
> random writes MADV_HUGEPAGE 28945251 usec
> random writes MADV_NOHUGEPAGE 32217690 usec
> random writes MADV_NOHUGEPAGE 30664731 usec
> random writes MADV_HUGEPAGE 22981082 usec
>
> guest backed by remote 4k pages:
>
> random writes MADV_HUGEPAGE 43772939 usec
> random writes MADV_NOHUGEPAGE 52745664 usec
> random writes MADV_NOHUGEPAGE 51632065 usec
> random writes MADV_HUGEPAGE 40263194 usec
>
> I haven't yet tried the guest mode on the skylake nor
> haswell/broadwell. I can do that too but I don't expect a significant
> difference.
>
> On a threadripper guest, the remote 2m is practically identical to
> local 4k. So shutting down compaction to try to generate local 4k
> memory looks a sure loss.
>

I'm assuming your results above are with a defrag setting of "madvise" or
"defer+madvise".

> Even if we ignore the guest mode results completely, if we don't make
> assumption on the workload to be able to fit in the node, if I use
> MADV_HUGEPAGE I think I'd prefer the risk of a -10% slowdown if the
> THP page ends up in a remote node, than not getting the +41% THP
> speedup on remote memory if the pagetable ends up being remote or the
> 4k page itself ends up being remote over time.
>

I'm agreeing with you that the preference for remote hugepages over local
small pages depends on the configuration and the workload that you are
running and there are clear advantages and disadvantages to both. This is
different than what the long-standing NUMA preferences have been for thp
allocations.

I think we can optimize for *both* usecases without causing an unnecessary
regression for other and doing so is not extremely complex.

Since it depends on the workload, specifically workloads that fit within a
single node, I think the reasonable approach would be to have a sane
default regardless of the use of MADV_HUGEPAGE or thp defrag settings and
then optimzie for the minority of cases where the workload does not fit in
a single node. I'm assuming there is no debate about these larger
workloads being in the minority, although we have single machines where
this encompasses the totality of their workloads.

Regarding the role of direct reclaim in the allocator, I think we need
work on the feedback from compaction to determine whether it's worthwhile.
That's difficult because of the point I continue to bring up:
isolate_freepages() is not necessarily always able to access this freed
memory. But for cases where we get COMPACT_SKIPPED because the order-0
watermarks are failing, reclaim *is* likely to have an impact in the
success of compaction, otherwise we fail and defer because it wasn't able
to make a hugepage available.

[ If we run compaction regardless of the order-0 watermark check and find
a pageblock where we can likely free a hugepage because it is
fragmented movable pages, this is a pretty good indication that reclaim
is worthwhile iff the reclaimed memory is beyond the migration scanner. ]

Let me try to list out what I think is a reasonable design for the various
configs assuming we are able to address the reclaim concern above. Note
that this is for the majority of users where workloads do not span
multiple nodes:

- defrag=always: always compact, obviously

- defrag=madvise/defer+madvise:

- MADV_HUGEPAGE: always compact locally, fallback to small pages
locally (small pages become eligible for khugepaged to collapse
locally later, no chance of additional access latency)

- neither MADV_HUGEPAGE nor MADV_NOHUGEPAGE: kick kcompactd locally,
fallback to small pages locally

- defrag=defer: kick kcompactd locally, fallback to small pages locally

- defrag=never: fallback to small pages locally

And that, AFAICT, has been the implementation for almost four years.

For workloads that *can* span multiple nodes, this doesn't make much
sense, as you point out and have reported in your bug. Considering the
reclaim problem separately where we thrash a node unnecessarily, if we
consider only hugepages and NUMA locality:

- defrag=always: always compact for all allowed zones, zonelist ordered
according to NUMA locality

- defrag=madvise/defer+madvise:

- MADV_HUGEPAGE: always compact for all allowed zones, try to allocate
hugepages in zonelist order, only fallback to small pages when
compaction fails

- neither MADV_HUGEPAGE nor MADV_NOHUGEPAGE: kick kcompactd for all
allowed zones, fallback to small pages locally

- defrag=defer: kick kcompactd for all allowed zones, fallback to small
pages locally

- defrag=never: fallback to small pages locally

For this policy to be possible, we must clear __GFP_THISNODE. How to
determine when to do this? I think we have three options: heuristics (rss
vs zone managed pages), per-process prctl(), or global thp setting for
machine-wide behavior.

I've been suggesting a per-process prctl() that can be set and carried
across fork so that there are no changes needed to any workload and can
simply special-case the thp allocation policy to use __GFP_THISNODE, which
is the default for bare metal, and to not use it when we've said the
workload will span multiple nodes. Depending on the size of the workload,
it may choose to use this setting on certain systems and not others.