Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

From: David Rientjes
Date: Wed Dec 05 2018 - 16:59:38 EST


On Wed, 5 Dec 2018, Andrea Arcangeli wrote:

> > thpscale Percentage Faults Huge
> > 4.20.0-rc4 4.20.0-rc4
> > mmots-20181130 gfpthisnode-v1r1
> > Percentage huge-3 95.14 ( 0.00%) 7.94 ( -91.65%)
> > Percentage huge-5 91.28 ( 0.00%) 5.00 ( -94.52%)
> > Percentage huge-7 86.87 ( 0.00%) 9.36 ( -89.22%)
> > Percentage huge-12 83.36 ( 0.00%) 21.03 ( -74.78%)
> > Percentage huge-18 83.04 ( 0.00%) 30.73 ( -63.00%)
> > Percentage huge-24 83.74 ( 0.00%) 27.47 ( -67.20%)
> > Percentage huge-30 83.66 ( 0.00%) 31.85 ( -61.93%)
> > Percentage huge-32 83.89 ( 0.00%) 29.09 ( -65.32%)
> >
> > They're down the toilet. 3 threads are able to get 95% of the requested
> > THP pages with Andrews tree as of Nov 30th. David's patch drops that to
> > 8% success rate.
>
> This is the downside of David's patch very well exposed above. And
> this will make non-NUMA system regress like above too despite they
> have no issue to begin with (which is probably why nobody noticed the
> trouble with __GFP_THISNODE reclaim until recently, combined with the
> fact most workloads can fit in a single NUMA node).
>
> So we're effectively crippling down MADV_HUGEPAGE effectiveness on
> non-NUMA (where it cannot help to do so) and on NUMA (as a workaround
> for the false positive swapout storms) because in some workload and
> system THP improvements are less significant than NUMA improvements.
>

For context, you're referring to the patch I posted that is similar to
__GFP_COMPACT_ONLY and patch 2/2 in my series. It's not referring to the
revert of the 4.20-rc commit that relaxes the __GFP_THISNODE restriction
on thp faults and conflates MADV_HUGEPAGE with NUMA locality. For 4.20, I
believe at minimum that patch 1/2 should be merged to restore what we have
had for three years, stop piling more semantics on top of the intent (or
perceived intent) of MADV_HUGEPAGE, and address the swap storm issue
separately.

> The higher fault latency is generally the higher cost you pay to get
> the good initial THP utilization for apps that do long lived
> allocations and in turn can use MADV_HUGEPAGE without downsides. The
> cost of compaction pays off over time.
>
> Short lived allocations sensitive to the allocation latency should not
> use MADV_HUGEPAGE in the first place. If you don't want high latency
> you shouldn't use MADV_HUGEPAGE and khugepaged already uses
> __GFP_THISNODE but it replaces memory so it has a neutral memory
> footprint at it, so it's ok with regard to reclaim.
>

Completely agreed, and is why we want to try synchronous memory compaction
to try to allocate hugepages locally in our usecases as well. We aren't
particularly concerned about the allocation latency, that is secondary to
the long-lived access latency regression that occurs when you do not set
__GFP_THISNODE.

> In my view David's workload is the outlier that uses MADV_HUGEPAGE but
> pretends a low latency and NUMA local behavior as first priority. If
> your workload fits in the per-socket CPU cache it doesn't matter which
> node it is but it totally matters if you've 2M or 4k tlb. I'm not even
> talking about KVM where THP has a multipler effect with EPT.
>

Hm, no, we do not mind the high allocation latency for MADV_HUGEPAGE
users. We *do* care about access latency and that is due to NUMA
locality. Before commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
MADV_HUGEPAGE mappings"), *all* thp faults were done with __GFP_THISNODE
and had been for at least three years. That commit conflates
MADV_HUGEPAGE with a new semantic that it allows remote allocation instead
of what it has done for three years: try harder synchronously to allocate
hugepages locally. We obviously need to address the problem in another
way and not change long-standing behavior that causes regressions. Either
my patch 2/2, __GFP_COMPACT_ONLY, a new mempolicy mode, new madvise mode,
prctl, etc.

> Even if you make the __GFP_NORETRY change for the HPAGE_PMD_ORDER to
> skip reclaim in David's patch conditional NUMA being enabled in the
> host (so that it won't cripple THP utilization also on non-NUMA
> systems), imagine that you go in the bios, turn off interleaving to
> enable host NUMA and THP utilization unexpectedly drops significantly
> for your VM.
>

What's needed is appropriate feedback from memory compaction to determine
if reclaim is worthwhile: checking only COMPACT_DEFERRED is insufficient.
We need to determine if compaction has failed due to order-0 low watermark
checks or whether it simply failed to defragment memory so a hugepage
could be allocated. Determining if compaction has failed due to order-0
low watermark checks is harder than it seems because the reclaimed memory
may not be accessible by isolate_freepages(); we don't have the ability to
only reclaim memory from the end of the zone. Otherwise, reclaim is very
unlikely to help.

> Rome ryzen architecture has been mentioned several times by David but
> in my threadripper (not-Rome, as it's supposed to be available in 2019
> only AFIK) enabling THP made a measurable difference for me for some
> workloads. As opposed if I turn off NUMA by setting up the
> interleaving in the dimm I get a barely measurable slowdown. So I'm
> surprised in Rome there's such a radical difference in behavior.
>

We can compare Naples if that's better; accessing remote hugepages over
local pages of the native page size incurs a 13.8% latency increase
intrasocket and 38.9% latency increase intersocket. Accessing remote
hugepages over remote pages of the native page size is 2.2% better
intrasocket and unchanged intersocket.

> Like Mel said we need to work towards a more complete solution than
> putting __GFP_THISNODE from the outside and then turning off reclaim
> from the inside. Mel made examples of things that should
> happen, that won't increase allocation latency and that can't happen
> with __GFP_THISNODE.
>
> I'll try to describe again what's going on:
>
> 1: The allocator is being asked through __GFP_THISNODE "ignore all
> remote nodes for all reclaim and compaction" from the
> outside. Compaction then returns COMPACT_SKIPPED and tells the
> allocator "I can generate many more huge pages if you reclaim/swapout
> 2M of anon memory in this node, the only reason I failed to compact
> memory is because there aren't enough 4k fragmented pages free in this
> zone". The allocator then goes ahead and swaps 2M and invokes
> compaction again that succeeds the order 9 allocation fine. Goto 1;
>
> The above keeps running in a loop at every additional page fault of
> the app using MADV_HUGEPAGE until all RAM of the node is swapped out
> and replaced by THP and all others nodes had 100% free memory,
> potentially 100% order 9, but the allocator completely ignored all
> other nodes. That is the thing that we're fixing here, because such
> swap storms caused massive slowdowns. If the workload can't fit in a
> single node, it's like running with only a fraction of the RAM.
>
> So David's patch (and __GFP_COMPACT_ONLY) to fix the above swap storm,
> inside the allocator skips reclaim entirely when compaction tells "I
> can generate one more HPAGE_PMD_ORDER compound page if you
> reclaim/swap 2M", if __GFP_NORETRY is set (and makes sure
> __GFP_NORETRY is always set for THP). And that however prevents to
> generate any more THP globally the moment any node is full of
> filesystem cache.
>
> NOTE: the filesystem cache will still be shrunk but it'll be shrunk by
> 4k allocations only. So we just artificially cripple compaction with
> David's patch as shown in the quoted results above. This applied to
> __GFP_COMPACT_ONLY too and that's I always said there's lots of margin
> for improvement there and even __GFP_COMPACT_ONLY was also a stop-gap
> measure.
>

Right, this is all the same as __GFP_COMPACT_ONLY which implicitly set
__GFP_NORETRY as part of the gfp flag itself.

> So ultimately we decided that the saner behavior that gives the least
> risk of regression for the short term, until we can do something
> better, was the one that is already applied upstream.
>
> Of course David's workload regressed, but that's because it gets a
> minuscle improvement from THP, maybe it's seeking across all RAM and
> it's very RAM heavy bandwidth-heavy workload so 4k or 2m tlb don't
> matter at all for his workload and probably he's running it on bare
> metal only.
>
> I think the challenge here is to get David's workload optimally
> without creating the above regression but we don't have a way to do it
> right now in an automatic way. It's trivial however to add a mbind new
> MPOL_THISNODE or MPOL_THISNODE_THP policy to force THP to set
> __GFP_THISNODE and return to the swap storm behavior that needle to
> say may have worked best by practically partioning the system, in fact
> you may want to use __GFP_THISNODE for 4k allocations too so you
> invoke reclaim from the local node before allocating RAM from the
> remote nodes.
>

I'm not sure why we'd be changing the default behavior of three years that
causes reported regressions instead of introducing a mempolicy that allows
for the remote allocation.

This commit from 4.0:

commit 077fcf116c8c2bd7ee9487b645aa3b50368db7e1
Author: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxxxxxxx>
Date: Wed Feb 11 15:27:12 2015 -0800

mm/thp: allocate transparent hugepages on local node

This make sure that we try to allocate hugepages from local node if
allowed by mempolicy. If we can't, we fallback to small page allocation
based on mempolicy. This is based on the observation that allocating
pages on local node is more beneficial than allocating hugepages on remote
node.

With this patch applied we may find transparent huge page allocation
failures if the current node doesn't have enough freee hugepages. Before
this patch such failures result in us retrying the allocation on other
nodes in the numa node mask.

[akpm@xxxxxxxxxxxxxxxxxxxx: fix comment, add CONFIG_TRANSPARENT_HUGEPAGE dependency]
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxxxxxxx>
Acked-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
Acked-by: Vlastimil Babka <vbabka@xxxxxxx>
Cc: David Rientjes <rientjes@xxxxxxxxxx>
Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>

Is still needed, and the reason why I'm requesting we maintain that
behavior of nearly four years and not conflate MADV_HUGEPAGE any more than
it already is.

> To me it doesn't seem the requirements of David's workload are the
> same as for other MADV_HUGEPAGE users, I can't image other
> MADV_HUGEPAGE users not to care at all if the THP utilization drops,
> for David it seems THP is a nice thing to have only, and it seems to
> care about allocation latency too (which normal apps using
> MADV_HUGEPAGE must not).
>

I have repeatedly said that allocation latency is secondary to the access
latency regression that not setting __GFP_THISNODE causes with a
fragmented local node on all of our platforms.

> In any case David's patch is better than reverting the revert, as the
> swap storms are a showstopper compared to crippling down compaction
> ability to compact memory when all nodes are full of cache.
>

I'm going to propose a v2 of my two patch series, including the feedback
from Michal in the first. I suggest the first step is to restore the
behavior we have had for three years to prevent the regression I'm
reporting and the kernel test robot has reported, and then look to
improvements that can be made to prevent local thrashing for users who
would rather allocate remote hugepages (and perhaps incur the access
latency issues if their workload does not span multiple nodes) instead of
trying harder to allocate locally.