Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

From: David Rientjes
Date: Sun Dec 09 2018 - 17:44:29 EST


On Wed, 5 Dec 2018, Andrea Arcangeli wrote:

> > I've must have said this at least six or seven times: fault latency is
>
> In your original regression report in this thread to Linus:
>
> https://lkml.kernel.org/r/alpine.DEB.2.21.1811281504030.231719@xxxxxxxxxxxxxxxxxxxxxxxxx
>
> you said "On a fragmented host, the change itself showed a 13.9%
> access latency regression on Haswell and up to 40% allocation latency
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> regression. This is more substantial on Naples and Rome. I also
> ^^^^^^^^^^
> measured similar numbers to this for Haswell."
>
> > secondary to the *access* latency. We want to try hard for MADV_HUGEPAGE
> > users to do synchronous compaction and try to make a hugepage available.
>
> I'm glad you said it six or seven times now, because you forgot to
> mention in the above email that the "40% allocation/fault latency
> regression" you reported above, is actually a secondary concern because
> those must be long lived allocations and we can't yet generate
> compound pages for free after all..
>

I've been referring to the long history of this discussion, namely my
explicit Nacked-by in https://marc.info/?l=linux-kernel&m=153868420126775
two months ago stating the 13.9% access latency regression. The patch was
nonetheless still merged and I proposed the revert for the same chief
complaint, and it was reverted.

I brought up the access latency issue three months ago in
https://marc.info/?l=linux-kernel&m=153661012118046 and said allocation
latency was a secondary concern, specifically that our users of
MADV_HUGEPAGE are willing to accept the increased allocation latency for
local hugepages.

> BTW, I never bothered to ask yet, but, did you enable NUMA balancing
> in your benchmarks? NUMA balancing would fix the access latency very
> easily too, so that 13.9% access latency must quickly disappear if you
> correctly have NUMA balancing enabled in a NUMA system.
>

No, we do not have CONFIG_NUMA_BALANCING enabled. The __GFP_THISNODE
behavior for hugepages was added in 4.0 for the PPC usecase, not by me.
That had nothing to do with the madvise mode: the initial documentation
referred to the mode as a way to prevent an increase in rss for configs
where "enabled" was set to madvise. The allocation policy was never about
MADV_HUGEPAGE in any 4.x kernel, it was only an indication for certain
defrag settings to determine how much work should be done to allocate
*local* hugepages at fault.

If you are saying that the change in allocator policy in a patch from
Aneesh almost four years ago and has gone unreported by anybody up until a
few months ago, I can understand the frustration. I do, however, support
the __GFP_THISNODE change he made because his data shows the same results
as mine.

I've suggested a very simple extension, specifically a prctl() mode that
is inherited across fork, that would allow a workload to specify that it
prefers remote allocations over local compaction/reclaim because it is too
large to fit on a single node. I'd value your feedback for that
suggestion to fix your usecase.