Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

From: David Rientjes
Date: Sun Oct 28 2018 - 17:45:12 EST


On Mon, 22 Oct 2018, Zi Yan wrote:

> Hi David,
>

Hi!

> On 22 Oct 2018, at 17:04, David Rientjes wrote:
>
> > On Tue, 16 Oct 2018, Mel Gorman wrote:
> >
> > > I consider this to be an unfortunate outcome. On the one hand, we have a
> > > problem that three people can trivially reproduce with known test cases
> > > and a patch shown to resolve the problem. Two of those three people work
> > > on distributions that are exposed to a large number of users. On the
> > > other, we have a problem that requires the system to be in a specific
> > > state and an unknown workload that suffers badly from the remote access
> > > penalties with a patch that has review concerns and has not been proven
> > > to resolve the trivial cases.
> >
> > The specific state is that remote memory is fragmented as well, this is
> > not atypical. Removing __GFP_THISNODE to avoid thrashing a zone will only
> > be beneficial when you can allocate remotely instead. When you cannot
> > allocate remotely instead, you've made the problem much worse for
> > something that should be __GFP_NORETRY in the first place (and was for
> > years) and should never thrash.
> >
> > I'm not interested in patches that require remote nodes to have an
> > abundance of free or unfragmented memory to avoid regressing.
>
> I just wonder what is the page allocation priority list in your environment,
> assuming all memory nodes are so fragmented that no huge pages can be
> obtained without compaction or reclaim.
>
> Here is my version of that list, please let me know if it makes sense to you:
>
> 1. local huge pages: with compaction and/or page reclaim, you are willing
> to pay the penalty of getting huge pages;
>
> 2. local base pages: since, in your system, remote data accesses have much
> higher penalty than the extra TLB misses incurred by the base page size;
>
> 3. remote huge pages: at least it is better than remote base pages;
>
> 4. remote base pages: it performs worst in terms of locality and TLBs.
>

I have a ton of different platforms available. Consider a very basic
access latency evaluation on Broadwell on a running production system:
remote hugepage vs remote PAGE_SIZE pages had about 5% better access
latency. Remote PAGE_SIZE pages vs local pages is a 12% degradation. On
Naples, remote hugepage vs remote PAGE_SIZE had 2% better access latency
intrasocket, no better access latency intersocket. Remote PAGE_SIZE pages
vs local is a 16% degradation intrasocket and 38% degradation intersocket.

My list removes (3) from your list, but is otherwise unchanged. I remove
(3) because 2-5% better access latency is nice, but we'd much rather fault
local base pages and then let khugepaged collapse it into a local hugepage
when fragmentation is improved or we have freed memory. That is where we
can get a much better result, 41% better access latency on Broadwell and
52% better access latncy on Naples. I wouldn't trade that for 2-5%
immediate remote hugepages.

It just so happens that prior to this patch, the implementation of the
page allocator matches this preference.

> In addition, to prioritize local base pages over remote pages,
> the original huge page allocation has to fail, then kernel can
> fall back to base page allocations. And you will never get remote
> huge pages any more if the local base page allocation fails,
> because there is no way back to huge page allocation after the fallback.
>

That is exactly what we want, we want khugepaged to collapse memory into
local hugepages for the big improvement rather than persistently access a
hugepage remotely; the win of the remote hugepage just isn't substantial
enough, and the win of the local hugepage is just too great.

> > I'd like to know, specifically:
> >
> > - what measurable affect my patch has that is better solved with removing
> > __GFP_THISNODE on systems where remote memory is also fragmented?
> >
> > - what platforms benefit from remote access to hugepages vs accessing
> > local small pages (I've asked this maybe 4 or 5 times now)?
> >
> > - how is reclaiming (and possibly thrashing) memory helpful if compaction
> > fails to free an entire pageblock due to slab fragmentation due to low
> > on memory conditions and the page allocator preference to return node-
> > local memory?
> >
> > - how is reclaiming (and possibly thrashing) memory helpful if compaction
> > cannot access the memory reclaimed because the freeing scanner has
> > already passed by it, or the migration scanner has passed by it, since
> > this reclaim is not targeted to pages it can find?
> >
> > - what metrics can be introduced to the page allocator so that we can
> > determine that reclaiming (and possibly thrashing) memory will result
> > in a hugepage being allocated?
>
> The slab fragmentation and whether reclaim/compaction can help form
> huge pages seem to orthogonal to this patch, which tries to decide
> the priority between locality and huge pages.
>

It's not orthogonal to the problem being reported which requires local
memory pressure. If there is no memory pressure, compaction often can
succeed without reclaim because the freeing scanner can find target
memory and the migration scanner can make a pageblock free. Under memory
pressure, however, where Andrea is experiencing the thrashing of the local
node, by this time it can be inferred that slab pages have already fallen
bcak to MIGRATE_MOVABLE pageblocks. There is nothing preventing it under
memory pressure because of the preference to return local memory over
fragmenting pageblocks.

So the point of slab fragmentation, which typically exists locally when
there is memory pressure, is that we cannot ascertain whether memory
compaction even with reclaim will be successful. Not only because the
freeing scanner cannot access reclaimed memory, but also because we have
no feedback from compaction to determine whether the work will be useful.
Thrashing the local node, migrating COMPACT_CLUSTER_MAX pages, finding one
slab page sitting in the pageblock, and continuing is not a good use of
the allocator's time. This is true of both MADV_HUGEPAGE and
non-MADV_HUGEPAGE regions.

For reclaim to be considered, we should ensure that work is useful to
compaction. That ability is non-existant. The worst case scenario is you
thrash the local node and still cannot allocate a hugepage.