Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

From: Andrea Arcangeli
Date: Thu Oct 04 2018 - 17:10:34 EST


Hello David,

On Thu, Oct 04, 2018 at 01:16:32PM -0700, David Rientjes wrote:
> There are ways to address this without introducing regressions for
> existing users of MADV_HUGEPAGE: introduce an madvise() mode to accept
> remote thp allocations, which users of this library would never set, or
> fix memory compaction so that it does not incur substantial allocation
> latency when it will likely fail.

These librarians needs to call a new MADV_ and the current
MADV_HUGEPAGE should not be affected because the new MADV_ will
require some capbility (i.e. root privilege).

qemu was the first user of MADV_HUGEPAGE and I don't think it's fair
to break it and require change to it to run at higher privilege to
retain the direct compaction behavior of MADV_HUGEPAGE.

The new behavior you ask to retain in MADV_HUGEPAGE, generated the
same misbehavior to VM as mlock could have done too, so it can't just
be given by default without any privilege whatsoever.

Ok you could mitigate the breakage that MADV_HUGEPAGE could have
generated (before the recent fix) by isolating malicious or
inefficient programs with memcg, but by default in a multiuser system
without cgroups the global disruption provided before the fix
(i.e. the pathological THP behavior) is not warranted. memcg shouldn't
be mandatory to avoid a process to affect the VM in such a strong way
(i.e. all other processes who happened to be allocated in the node
where the THP allocation triggered, being trashed in swap like if all
memory of all other nodes was not completely free).

Not only that, it's not only about malicious processes it's also
excessively inefficient for processes that just don't fit in a local
node and use MADV_HUGEPAGE. Your processes all fit in the local node
for sure if they're happy about it. This was reported as a
"pathological THP regression" after all in a workload that couldn't
swap at all because of the iommu gup persistent refcount pins.

The alternative patch I posted which still invoked direct reclaim
locally, and falled back to NUMA-local PAGE_SIZEd allocations instead
of falling back to NUMA-remote THP, could have been given without
privilege, but it still won't swapout as this patch would have done,
so it still won't go as far as the MADV_HUGEPAGE behavior had before
the fix (for the lib users that strongly needed local memory).

Overall I think the call about the default behavior of MADV_HUGEPAGE
is still between removing __GFP_THISNODE if gfp_flags can reclaim (the
fix in -mm), or by changing direct compaction to only call compaction
and not reclaim (i.e. __GFP_COMPACT_ONLY) when __GFP_THISNODE is set.

To go beyond that some privilege is needed and a new MADV_ flag can
require privilege or return error if there's not enough privilege. So
the lib with 100's users can try to use that new flag first, show an
error in stderr (maybe under debug), and fallback to MADV_HUGEPAGE if
the app hasn't enough privilege. The alternative is to add a new mem
policy less strict than MPOL_BIND to achieve what you need on top of
MADV_HUGEPAGE (which also would require some privilege of course as
all mbinds). I assume you already evaluated the preferred and local
mbinds and it's not a perfect fit?

If we keep this as a new MADV_HUGEPAGE_FORCE_LOCAL flag, you could
still add a THP sysfs/sysctl control to lift the privilege requirement
marking it as insecure setting in docs
(mm/transparent_hugepage/madv_hugepage_force_local=0|1 forced to 0 by
default). This would be on the same lines of other sysctl that
increase the max number of files open and such things (perhaps a
sysctl would be better in fact for tuning in /etc/sysctl.conf).

Note there was still some improvement left possible in my
__GFP_COMPACT_ONLY patch alternative. Notably if the watermarks for
the local node shown the local node not to have enough real "free"
PAGE_SIZEd pages to succeed the PAGE_SIZEd local THP allocation if
compaction failed, we should have relaxed __GFP_THISNODE and tried to
allocate THP from the NUMA-remote nodes before falling back to
PAGE_SIZEd allocations. That also won't require any new privilege.

To get the same behavior as before though you would need to drop all
caches with echo 3 >drop_caches before the app starts (and it still
won't swap anon memory which previous MADV_HUGEPAGE behavior would
have, but the whole point is that the previous MADV_HUGEPAGE behavior
would have backfired for the lib too if the app was allocating so much
RAM from the local node that it required swapouts of local anon
memory).

Thanks,
Andrea