Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
From: David Rientjes
Date: Mon Oct 15 2018 - 18:30:21 EST
On Wed, 10 Oct 2018, David Rientjes wrote:
> > I think "madvise vs mbind" is more an issue of "no-permission vs
> > permission" required. And if the processes ends up swapping out all
> > other process with their memory already allocated in the node, I think
> > some permission is correct to be required, in which case an mbind
> > looks a better fit. MPOL_PREFERRED also looks a first candidate for
> > investigation as it's already not black and white and allows spillover
> > and may already do the right thing in fact if set on top of
> > MADV_HUGEPAGE.
> >
>
> We would never want to thrash the local node for hugepages because there
> is no guarantee that any swapping is useful. On COMPACT_SKIPPED due to
> low memory, we have very clear evidence that pageblocks are already
> sufficiently fragmented by unmovable pages such that compaction itself,
> even with abundant free memory, fails to free an entire pageblock due to
> the allocator's preference to fragment pageblocks of fallback migratetypes
> over returning remote free memory.
>
> As I've stated, we do not want to reclaim pointlessly when compaction is
> unable to access the freed memory or there is no guarantee it can free an
> entire pageblock. Doing so allows thrashing of the local node, or remote
> nodes if __GFP_THISNODE is removed, and the hugepage still cannot be
> allocated. If this proposed mbind() that requires permissions is geared
> to me as the user, I'm afraid the details of what leads to the thrashing
> are not well understood because I certainly would never use this.
>
At the risk of beating a dead horse that has already been beaten, what are
the plans for this patch when the merge window opens? It would be rather
unfortunate for us to start incurring a 14% increase in access latency and
40% increase in fault latency. Would it be possible to test with my
patch[*] that does not try reclaim to address the thrashing issue? If
that is satisfactory, I don't have a strong preference if it is done with
a hardcoded pageblock_order and __GFP_NORETRY check or a new
__GFP_COMPACT_ONLY flag.
I think the second issue of faulting remote thp by removing __GFP_THISNODE
needs supporting evidence that shows some platforms benefit from this (and
not with numa=fake on the command line :).
[*] https://marc.info/?l=linux-kernel&m=153903127717471