Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

From: David Rientjes
Date: Sun Sep 08 2019 - 16:45:21 EST


On Sun, 8 Sep 2019, Vlastimil Babka wrote:

> > On Sat, 7 Sep 2019, Linus Torvalds wrote:
> >
> >>> Andrea acknowledges the swap storm that he reported would be fixed with
> >>> the last two patches in this series
> >>
> >> The problem is that even you aren't arguing that those patches should
> >> go into 5.3.
> >>
> >
> > For three reasons: (a) we lack a test result from Andrea,
>
> That's argument against the rfc patches 3+4s, no? But not for including
> the reverts of reverts of reverts (patches 1+2).
>

Yes, thanks: I would strongly prefer not to propose rfc patches 3-4
without a testing result from Andrea and collaboration to fix the
underlying issue. My suggestion to Linus is to merge patches 1-2 so we
don't have additional semantics for MADV_HUGEPAGE or thp enabled=always
configs based on kernel version, especially since they are already
conflated.

> > (b) there's
> > on-going discussion, particularly based on Vlastimil's feedback, and
>
> I doubt this will be finished and tested with reasonable confidence even
> for the 5.4 merge window.
>

Depends, but I probably suspect the same. If the reverts to 5.3 are not
applied, then I'm not at all confident that forward progress on this issue
will be made: my suggestion about changes to the page allocator when the
patches were initially proposed went unresponded to, as did the ping on
those suggestions, and now we have a simplistic "this will fix the swap
storms" but no active involvement from Andrea to improve this; he likely
is quite content on lumping NUMA policy onto an already overloaded madvise
mode.

[ NOTE! The rest of this email and my responses are about how to address
the default page allocation behavior which we can continue to discuss
but I'd prefer it separated from the discussion of reverts for 5.3
which needs to be done to not conflate madvise modes with mempolicies
for a subset of kernel versions. ]

> > It indicates that progress has been made to address the actual bug without
> > introducing long-lived access latency regressions for others, particularly
> > those who use MADV_HUGEPAGE. In the worst case, some systems running
> > 5.3-rc4 and 5.3-rc5 have the same amount of memory backed by hugepages but
> > on 5.3-rc5 the vast majority of it is allocated remotely. This incurs a
>
> It's been said before, but such sensitive code generally relies on
> mempolicies or node reclaim mode, not THP __GFP_THISNODE implementation
> details. Or if you know there's enough free memory and just needs to be
> compacted, you could do it once via sysfs before starting up your workload.
>

This entire discussion is based on the long standing and default behavior
of page allocation for transparent hugepages. Your suggestions are not
possible for two reasons: (1) I cannot enforce a mempolicy of MPOL_BIND
because this doesn't allow fallback at all and would oom kill if the local
node is oom, and (2) node reclaim mode is a system-wide setting so all
workloads are affected for every page allocation, not only users of
MADV_HUGEPAGE who specifically opt-in to expensive allocation.

We could make the argument that Andrea's qemu usecase could simply use
MPOL_PREFERRED for memory that should be faulted remotely which would
provide more control and would work for all versions of Linux regardless
of MADV_HUGEPAGE or not; that's a much more simple workaround than
conflating MADV_HUGEPAGE for NUMA locality, asking users who are adversely
affected by 5.3 to create new mempolicies to work around something that
has always worked fine, or asking users to tune page allocator policies
with sysctls.

> > I'm arguing to revert 5.3 back to the behavior that we have had for years
> > and actually fix the bug that everybody else seems to be ignoring and then
> > *backport* those fixes to 5.3 stable and every other stable tree that can
> > use them. Introducing a new mempolicy for NUMA locality into 5.3.0 that
>
> I think it's rather removing the problematic implicit mempolicy of
> __GFP_THISNODE.
>

I'm referring to a solution that is backwards compatible for existing
users which 5.3 is certainly not.

> I might have missed something, but you were asked for a reproducer of
> your use case so others can develop patches with it in mind? Mel did
> provide a simple example that shows the swap storms very easily.
>

Are you asking for a synthetic kernel module that you can inject to induce
fragmentation on a local node where memory compaction would be possible
and then a userspace program that uses MADV_HUGEPAGE and fits within that
node? The regression I'm reporting is for workloads that fit within a
socket, it requires local fragmentation to show a regression.

For the qemu case, it's quite easy to fill a local node and require
additional hugepage allocations with MADV_HUGEPAGE in a test case, but for
without synthetically inducing fragmentation I cannot provide a testcase
that will show performance regression because memory is quickly faulted
remotely rather than compacting locally.