Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

From: Andrea Arcangeli
Date: Tue Oct 09 2018 - 18:21:36 EST

Next message: Dennis Zhou: "[GIT PULL] percpu fixes for-4.19-rc8"
Previous message: Joel Fernandes (Google): "[PATCH v2 1/2] mm: Add an F_SEAL_FS_WRITE seal to memfd"
In reply to: Vlastimil Babka: "Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Oct 08, 2018 at 01:41:09PM -0700, David Rientjes wrote:
> The page allocator is expecting __GFP_NORETRY for thp allocations per its
> comment:
>
> /*
> * Checks for costly allocations with __GFP_NORETRY, which
> * includes THP page fault allocations
> */
> if (costly_order && (gfp_mask & __GFP_NORETRY)) {
>
> And that enables us to check compact_result to determine whether thp
> allocation should fail or continue to reclaim. I don't think it helps
> that some thp fault allocations use __GFP_NORETRY and others do not. I
> think that deserves a fix to alloc_hugepage_direct_gfpmask() or
> GFP_TRANSHUGE_LIGHT.

With your patch that changes how __GFP_NORETRY works for order =
pageblock_order then this can help. However it solves nothing for
large skbs and all other "costly_order" allocations. So if you think
it's so bad to have a 40% increased allocation latency if all nodes
are heavily fragmented for THP under MADV_HUGEPAGE (which are
guaranteed to be long lived allocations or they wouldn't set
MADV_HUGEPAGE in the first place), I'm not sure why you ignore the
same exact overhead for all other "costly_order" allocations that
would never set __GFP_THISNODE and will not even use
GFP_TRANSHUGE_LIGHT so they may not set __GFP_NORETRY at all.

I think it would be better to view the problem from a generic
"costly_order" allocation prospective without so much "hardcoded"
focus on THP MADV_HUGEPAGE only (which in fact are the least concern
of all in terms of that 40% latency increase if all nodes are totally
fragmented and compaction fails on all nodes, because they're long
lived so the impact of the increased latency is more likely to be lost
in the noise).

> Our library that uses MADV_HUGEPAGE only invokes direct compaction because
> we're on an older kernel: it does not attempt to do reclaim to make
> compaction happy so that it finds memory that it can migrate memory to.
> For reference, we use defrag setting of "defer+madvise". Everybody who
> does not use MADV_HUGEPAGE kicks off kswapd/kcompactd and fails,
> MADV_HUGEPAGE users do the same but also try direct compaction. That
> direct compaction uses a mode of MIGRATE_ASYNC so it normally fails
> because of need_resched() or spinlock contention.
> These allocations always fail, MADV_HUGEPAGE or otherwise, without
> invoking direct reclaim.

Even older kernels (before "defer+madvise" option was available)
always invoked reclaim if COMPACT_SKIPPED was returned. The
COMPACT_SKIPPED behavior I referred that explains why __GFP_NORETRY
doesn't prevent reclaim to be invoked, is nothing new, it always
worked that way from the day compaction was introduced.

So it's fairly strange that your kernel doesn't call reclaim at all if
COMPACT_SKIPPED is returned.

> I am agreeing with both you and Mel that it makes no sense to thrash the
> local node to make compaction happy and then hugepage-order memory
> available. I'm only differing with you on the mechanism to fail early: we
> never want to do attempt reclaim on thp allocations specifically because
> it leads to the behavior you are addressing.

Not sure I can agree with the above.

If all nodes (not only the local one) are already below the watermarks
and compaction returns COMPACT_SKIPPED, there is zero global free
memory available, we would need to swap anyway to succeed the 2M
allocation. So it's better to reclaim 2M from on node and then retry
compaction again on the same node if compaction is failing on the node
because of COMPACT_SKIPPED and the global free memory is zero. If it
swapouts it would have swapped out anyway but this way THP will be
returned.

It's just __GFP_THISNODE that broke the above logic. __GFP_THISNODE
may just be safe to use only if the global amount of free memory (in
all nodes) is zero (or below HPAGE_PMD_SIZE generally speaking).

> My contention is that removing __GFP_THISNODE papers over the problem,
> especially in cases where remote memory is also fragmnented. It incurs a
> much higher (40%) fault latency and then incurs 13.9% greater access
> latency. It is not a good result, at least for Haswell, Naples, and Rome.
>
> To make a case that we should fault hugepages remotely as fallback, either
> for non-MADV_HUGEPAGE users who do not use direct compaction, or
> MADV_HUGEPAGE users who use direct compaction, we need numbers that
> suggest there is a performance benefit in terms of access latency to
> suggest that it is better; this is especially the case when the fault
> latency is 40% higher. On Haswell, Naples, and Rome, it is quite obvious
> that this patch works much harder to fault memory remotely that incurs a
> substantial performance penalty when it fails and in the cases where it
> succeeds we have much worse access latency.

What we're fixing is a effectively a breakage and insecure behavior
too and that must be fixed first. A heavily multithreaded processes
with its memory not fitting in a single NUMA node is too common of a
workload to misbehave so bad as it did. The above issue is a much
smaller concern and it's not a breakage nor insecure.

__GFP_THISNODE semantics are not enough to express what you need
there.

Like Mel said we need two tests, one that represents the "pathological
THP allocation behavior" which is trivial (any heavy multithreaded app
accessing all memory of all nodes from all CPUs will do). And one that
represents your workload.

By thinking about it, it's almost impossible that eliminating 100% of
swapouts from the multithreaded app runtime, will not bring a benefit
that is orders of magnitude bigger than the elimination of remote
compaction and having more local THP pages for your app.

It's just on a different scale and the old behavior was insecure to
begin with because it allowed a process to heavily penalize the
runtime of other apps running with different uid, without requiring
any privilege.

> For users who bind their applications to a subset of cores on a NUMA node,
> fallback is egregious: we are guaranteed to never have local access
> latency and in the case when the local node is fragmented and remote
> memory is not our fault latency goes through the roof when local native
> pages would have been much better.
>
> I've brought numbers on how poor this patch performs, so I'm asking for a
> rebuttal that suggests it is better on some platforms. (1) On what
> platforms is it better to fault remote hugepages over local native pages?
> (2) What is the fault latency increase when remote nodes are fragmented as
> well?

That I don't think is the primary issue, the issue here is the one of
the previous paragrah.

Then there may be workloads and/or platforms where allocating remote
THP instead of local PAGE_SIZEd memory is better regardless and
__GFP_THISNODE hurted twice (but we could ignore those for now, and
just focus where __GFP_THISNODE provided a benefit like for your app
using the lib).

> > Like Mel said, your app just happens to fit in a local node, if the
> > user of the lib is slightly different and allocates 16G on a system
> > where each node is 4G, the post-fix MADV_HUGEPAGE will perform
> > extremely better also for the lib user.
> >
>
> It won't if the remote memory is fragmented; the patch is premised on the

How could compaction run slower than heavy swapping gigabytes of
memory?

> argument that remote memory is never fragmented or under memory pressure
> otherwise it is multiplying the fault latency by the number of nodes.
> Sure, creating test cases where the local node is under heavy memory
> pressure yet remote nodes are mostly free will minimize the impact this
> has on fault latency. It still is a substantial increase, as I've
> measured on Haswell, but the access latency forever is the penalty. This
> patch cannot make access to remote memory faster.

I still doubt you measured the performance of the app using the lib
when the app requires more 4 times the size of the local NUMA node
under "taskset -c 0". If you do that, the app using the lib shall fall
into the "pathological THP allocation behavior" too. As long as the
spillover is 10-20% perhaps it's not as bad and if it's not heavily
multithreaded doing scattered random access like a virtual machine
will do, NUMA balancing will hide the issue by moving the thread to
CPU of the NUMA node where the 10-20% spillover happened (which is why
I asked "taskset -c 0" to simulate a multithreaded app accessing all
memory in a scattered way without locality.

The multithreaded app with scattered access is not unreasonable
workload and it must perform well: all workloads using MPOL_INTERLEAVE
would fit the criteria, not just qemu.

> > And you know, if the lib user fits in one node, it can use mbind and
> > it won't hit OOM... and you'd need some capability giving the app
> > privilege anyway to keep MADV_HUGEPAGE as deep and unfair to the rest
> > of the processes running the local node (like mbind and mlock require
> > too).
> >
>
> No, the lib user does not fit into a single node, we have cases where
> these nodes are under constant memory pressure. We are on an older
> kernel, however, that fails because of need_resched() and spinlock
> contention rather than falling back to local reclaim.

It's unclear which need_resched() and spinlock contention induces this
failure.

> > Did you try the __GFP_COMPACT_ONLY patch? That won't have the 40%
> > fault latency already.
> >
>
> I remember Kirill saying that he preferred node local memory allocation
> over thp allocation in the review, which I agree with, but it was much
> better than this patch. I see no reason why we should work so hard to
> reclaim memory, thrash local memory, and swap so that the compaction
> freeing scanner can find memory when there is *no* guarantee that the
> migration scanner can free an entire pageblock. That is completely
> pointless work, even for an MADV_HUGEPAGE user, and we certainly don't
> have such pathological behavior on older kernels.

The compaction scanner is succeeding at migrating the entire
pageblock every time. The local node isn't so fragmented, it's just
full of anonymous memory that is getting swapped out at every further
invocation of reclaim. If it was fragmented __GFP_NORETRY would have
had an effect because compaction would have been deferred.

> Completely agreed, and I'm very strongly suggesting that we do not invoke
> reclaim. Memory compaction provides no guarantee that it can free an
> entire pageblock, even for MIGRATE_ASYNC compaction under MADV_HUGEPAGE it
> will attempt to migrate SWAP_CLUSTER_MAX pages from a pageblock without
> considering if the entire pageblock could eventually become freed for
> order-9 memory. The vast majority of the time this reclaim could be
> completely pointless. We cannot use it.

If we don't invoke reclaim __GFP_THISNODE is fine to keep, however
that means direction compaction doesn't work at all if all nodes are
full of filesystem cache. This was a drawback of __GFP_COMPACT_ONLY
too. A minor one compared to the corner case created by
__GFP_THISNODE.

> We don't need __GFP_NORETRY anymore with this, it wouldn't be useful to
> continually compact memory in isolation if it already has failed. It
> seems strange to be using __GFP_DIRECT_RECLAIM | __GFP_ONLY_COMPACT,
> though, as I assume this would evolve into.
>
> Are there any other proposed users for __GFP_ONLY_COMPACT beyond thp
> allocations? If not, we should just save the gfp bit and encode the logic
> directly into the page allocator.

Yes that's seems equivalent, just it embeds __GFP_COMPACT_ONLY in
__GFP_NORETRY and ends up setting __GFP_NORETRY for everything,
__GFP_COMPACT_ONLY didn't alter the __GFP_NORETRY semantics instead.

> Would you support this?

I didn't consider it a problem the 40% increased allocation latency, I
thought __GFP_COMPACT_ONLY or the other fix in -mm was just about the
remote THP vs local PAGE_SIZEd memory tradeoff (considering this logic
was already hardcoded only for THP so I tried to keep it).

Now that you raise this higher compaction latency as a big regression
in not fixing the VM with __GFP_COMPACT_ONLY semantics in
__GFP_NORETRY, the next question is why you don't care at all about
such increased latency for all other "costly_order" allocations that
won't use __GFP_THISNODE and where your altered __GFP_NORETRY won't
help to skip compaction on the remote nodes.

In other words if such compaction invocation in fully fragmented
remote nodes is such a big concern for THP, why is not an issue at all
for all other "costly_order" allocations?

> + if (order == pageblock_order &&
> + !(current->flags & PF_KTHREAD))
> + goto nopage;

For a more reliable THP hardcoding HPAGE_PMD_ORDER would be better
than pageblock_order (I guess especially for non-x86 archs where the
match between hugetlbfs size and thp size is less obvious), but then
the hardcoding itself now looks the worst part of this patch if the
concern is the 40% increased allocation latency (and not only the
remote compound page vs local 4k page).

Thanks,
Andrea

Next message: Dennis Zhou: "[GIT PULL] percpu fixes for-4.19-rc8"
Previous message: Joel Fernandes (Google): "[PATCH v2 1/2] mm: Add an F_SEAL_FS_WRITE seal to memfd"
In reply to: Vlastimil Babka: "Re: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]