Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

From: David Rientjes
Date: Fri Oct 05 2018 - 16:35:21 EST


On Fri, 5 Oct 2018, Mel Gorman wrote:

> > This causes, on average, a 13.9% access latency regression on Haswell, and
> > the regression would likely be more severe on Naples and Rome.
> >
>
> That assumes that fragmentation prevents easy allocation which may very
> well be the case. While it would be great that compaction or the page
> allocator could be further improved to deal with fragmentation, it's
> outside the scope of this patch.
>

Hi Mel,

The regression that Andrea is working on, correct me if I'm wrong, is
heavy reclaim and swapping activity that is trying to desperately allocate
local hugepages when the local node is fragmented based on advice provided
by MADV_HUGEPAGE.

Why is it ever appropriate to do heavy reclaim and swap activity to
allocate a transparent hugepage? This is exactly what the __GFP_NORETRY
check for high-order allocations is attempting to avoid, and it explicitly
states that it is for thp faults. The fact that we lost __GFP_NORERY for
thp allocations for all settings, including the default setting, other
than yours (setting of "always") is what I'm focusing on. There is no
guarantee that this activity will free an entire pageblock or that it is
even worthwhile.

Why is thp memory ever being allocated without __GFP_NORETRY as the page
allocator expects?

That aside, removing __GFP_THISNODE can make the fault latency much worse
if remote notes are fragmented and/or reclaim has the inability to free
contiguous memory, which it likely cannot. This is where I measured over
40% fault latency regression from Linus's tree with this patch on a
fragmnented system where order-9 memory is neither available from node 0
or node 1 on Haswell.

> > There exist libraries that allow the .text segment of processes to be
> > remapped to memory backed by transparent hugepages and use MADV_HUGEPAGE
> > to stress local compaction to defragment node local memory for hugepages
> > at startup.
>
> That is taking advantage of a co-incidence of the implementation.
> MADV_HUGEPAGE is *advice* that huge pages be used, not what the locality
> is. A hint for strong locality preferences should be separate advice
> (madvise) or a separate memory policy. Doing that is outside the context
> of this patch but nothing stops you introducing such a policy or madvise,
> whichever you think would be best for the libraries to consume (I'm only
> aware of libhugetlbfs but there might be others).
>

The behavior that MADV_HUGEPAGE specifies is certainly not clearly
defined, unfortunately. The way that an application writer may read it,
as we have, is that it will make a stronger attempt at allocating a
hugepage at fault. This actually works quite well when the allocation
correctly has __GFP_NORETRY, as it's supposed to, and compaction is
MIGRATE_ASYNC.

So rather than focusing on what MADV_HUGEPAGE has meant over the past 2+
years of kernels that we have implemented based on, or what it meant prior
to that, is a fundamental question of the purpose of direct reclaim and
swap activity that had always been precluded before __GFP_NORETRY was
removed in a thp allocation. I don't think anybody in this thread wants
14% remote access latency regression if we allocate remotely or 40% fault
latency regression when remote nodes are fragmented as well.

Removing __GFP_THISNODE only helps when remote memory is not fragmented,
otherwise it multiplies the problem as I've shown.

The numbers that you provide while using the non-default option to mimick
MADV_HUGEPAGE mappings but also use __GFP_NORETRY makes the actual source
of the problem quite easy to identify: there is an inconsistency in the
thp gfp mask and the page allocator implementation.

> > The cost, including the statistics Mel gathered, is
> > acceptable for these processes: they are not concerned with startup cost,
> > they are concerned only with optimal access latency while they are
> > running.
> >
>
> Then such applications at startup have the option of setting
> zone_reclaim_mode during initialisation assuming a privileged helper
> can be created. That would be somewhat heavy handed and a longer-term
> solution would still be to create a proper memory policy of madvise flag
> for those libraries.
>

We *never* want to use zone_reclaim_mode for these allocations, that would
be even worse, we do not want to reclaim because we have a very unlikely
chance of making pageblocks free without the involvement of compaction.
We want to trigger memory compaction with a well-bounded cost that
MIGRATE_ASYNC provides and then fail.