Re: [RFC PATCH] mm: thp: implement THP reservations for anonymous memory

From: Andrea Arcangeli
Date: Sat Nov 10 2018 - 11:44:21 EST


On Sat, Nov 10, 2018 at 01:22:49PM +0000, Mel Gorman wrote:
> On Fri, Nov 09, 2018 at 02:51:50PM -0500, Andrea Arcangeli wrote:
> > And if you're in the camp that is concerned about the use of more RAM
> > or/and about the higher latency of COW faults, I'm afraid the
> > intermediate solution will be still slower than the already available
> > MADV_NOHUGEPAGE or enabled=madvise.
> >
>
> Does that not prevent huge page usage? Maybe you can spell it out a bit

Yes it prevents huge page usage, but preventing the huge page usage is
also what is achieved with the reservation.

> better. What is the set of system calls an application should make to
> not use huge pages either for the address space or on a per-VMA basis
> and defer to kcompactd? I know that can be tuned globally but that's not
> quite the same thing given that multiple applications or containers can
> be running with different requirements.

Yes, in terms of inheritance that could be used to tune a container
we've only PR_SET_THP_DISABLE, and that will render MADV_HUGEPAGE
useless too, but then for microservices that should not be a
concern. How to make those sysfs tunables reentrant in namespaces is a
separate issue I think.

The difference is that with the reservation over time they can be
promoted, with MADV_NOHUGEPAGE they cannot become hugepages later, not
even khugepaged will scan that vma anymore.

The benefit of the reservation will showup in those regions that will
not become hugepages, so if you can predict beforehand that those
ranges don't benefit from THP, it's better if userland calls
madvise(MADV_NOHUGEPAGE) on the range and then there's no need to undo
the reservation later during memory pressure.

The reservation and promotion is a bit like auto-detecting when
MADV_NOHUGEPAGE should be set, so it boils down of how much of a
corner case that is.

I'm not so concerned about the RAM wasted because I don't think it's
very significant, after all the application can just do a smaller
malloc if it wants to reduce memory usage.

A massive amount of huge RAM waste is fairly rare and to the extreme
it could still be wasted even with 4k if the app uses only 1 bit from
every 4k page it allocates with malloc.

I'm more concerned about cases where THP is wasting CPU: like in redis
that is hurted by the 2M COWs. redis will map all pages and they will
be all promoted to THP also with the reservation logic applied, but
when the parent writes to the memory (after fork) it must trigger 4k
cows (not 2M cows) and in turn split the THP before the COW, or it
won't work as fast as with THP disabled. In addition we should try to
reuse the same IPI for the transhuge pmd split to cover the COW too.

If we add the reservation and that work makes zero difference for the
redis corner case, and redis must still use MADV_NOHUGEPAGE, it's not
great in my view. It looks like we're trying to optimize issues that
are less critical.

The redis+THP case should be possible to optimize later with uffd WP
model (once completed, Peter Xu is working on it), and uffd WP will
also remove fork() and it'll convert it to a clone(). The granularity
of the fault is decided by the userland that way so when uffd
wrprotects a 4k fragment of a THP, the THP will be split during the
uffd mprotect ioctl.

> > Now about the implementation: the whole point of the reservation
> > complexity is to skip the khugepaged copy, so it can collapse in
> > place. Is skipping the copy worth it? Isn't the big cost the IPI
> > anyway to avoid leaving two simultaneous TLB mappings of different
> > granularity?
> >
>
> Not necessarily. With THP anon in the simple case, it might be just a
> single thread and kcompact so that's one IPI (kcompactd flushes local and
> one IPI to the CPU the thread was running on assuming it's not migrating
> excessively). It would scale up with the number of threads but I suspect
> the main cost is the actual copying, page table manipulation and the
> locking required.

Agreed, the IPI wouldn't be a concern for a single threaded app. I was
looking more at the worst case scenario. For a single threaded app the
locking should not be too bad either.

> As an aside, a universal benefit would be looking at reducing the time
> to allocate the necessary huge page as we know that can be excessive. It
> would be ortogonal to this series.

With what I suggested the allocation would happen as usual in
khugepaged at slow peace, without holding locks. So I don't see
obvious disadvantages in terms of THP allocation latency.

> Could you and Kirill outline what sort of workloads you would consider
> acceptable for evaluating this series? One would assume it covers at
> least the following, potentially with a number of workloads.

I would prefer to add intelligence to detect when COWs after fork
should be done at 2m or 4k granularity (in the latter case by
splitting the pmd before the actual COW while leaving the transhuge
pmd intact in the other mm), because that would save CPU (and it'd
automatically optimize redis). The snapshot process especially would
run faster as it will read with THP performance.

I'm more worried to ensure THP doesn't cause more CPU usage like it
happens to the above case in COWs, than to just try to save RAM when
the virtual ranges are only partially utilized by the app.

> 1. Evaluate the collapse and copying costs (probing the entire time
> spent in collapse_huge_page might do it)
> 2. Evaluate mmap_sem hold time during hugepage collapse
> 3. Estimate excessive RAM use due to unnecessary THP usage
> 4. Estimate the slowdown due to delayed THP usage
>
> 1 and 2 would indicate how much time is lost due to not using
> reservations. That potentially goes in the direction of simply making
> this faster -- fragmentation reduction (posted but unreviewed), faster
> compaction searches, better page isolation during compaction to
> avoid free pages being reused before an order-9 is free.
>
> 3 should be straight-forward but 4 would be the hardest to evaluate
> because it would have to be determimed if 4 is offset by improvements to
> 1-3. If 1-3 is improved enough, it might remove the motivation for the
> series entirely.
>
> In other words, if we agree on a workload in advance, it might bring
> this the right direction and not accidentally throw Anthony down a hole
> working on a series that never gets ack'd.
>
> I'm not necessarily the best person to answer because my natural inclination
> after the fragmentation series would be to keep using thpfiosacle
> (from the fragmentation avoidance series) and work on improving the THP
> allocation success rates and reduce latencies. I've tunnel vision on that
> for the moment.

Deciding the workloads is a good question indeed, but I would also be
curious to how many of those pages would not end up to be promoted
with this logic.

What's the number of pte_none that you require in each pmd to avoid
promotion? If it's just 1 then apps will run slower, if there's
partial utilization THP already helps. I've an hard time to think at
an ideal ratio, this is why max_ptes_none is 511 after all.

Can we start by counting the total number of pte_none() in all pmds
that can fit a THP according to vma->vm_start/end? The pagetable
dumper in debugfs may already provide the info we need by scanning all
mm and by printing the number of "none" pte that would generate
"wasted" memory (and marginally wasted CPU during copy/clear).

Then you can exactly tell how many pmds won't be promoted to transhuge
pmds with the patch applied in the real life workloads, even before
running any benchmark. It'd be good to be sure we're talking about a
significant number in real life workloads or there's not much to
optimize to begin with.

If the amount of RAM saved is significant in real life workloads and
in turn there's a chance of having a worthwhile tradeoff from the
reservation logic, then we can do the benchmarks because the behavior
will be different for the page fault, and it'll end up running slower
with the reservation logic.

Thanks,
Andrea