Re: [PATCH v2] mm: Reduce memory bloat with THP

From: Mel Gorman
Date: Thu Jan 25 2018 - 16:13:18 EST

Next message: kbuild test robot: "Re: [PATCH 2/4] usb: dwc3: add dwc3 glue layer for UniPhier SoCs"
Previous message: Andy Lutomirski: "[PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems"
In reply to: Nitin Gupta: "Re: [PATCH v2] mm: Reduce memory bloat with THP"
Next in thread: Nitin Gupta: "Re: [PATCH v2] mm: Reduce memory bloat with THP"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote:
> >> It's not really about memory scarcity but a more efficient use of it.
> >> Applications may want hugepage benefits without requiring any changes to
> >> app code which is what THP is supposed to provide, while still avoiding
> >> memory bloat.
> >>
> > I read these links and find that there are mainly two complains:
> > 1. THP causes latency spikes, because direction compaction slows down THP allocation,
> > 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return memory ranges smaller than
> > THP size and fails because of THP.
> >
> > The first complain is not related to this patch.
>
> I'm trying to address many different THP issues and memory bloat is
> first among them.

Expecting userspace to get this right is probably going to go sideways.
It'll be screwed up and be sub-optimal or have odd semantics for existing
madvise flags. The fact is that an application may not even know if it's
going to be sparsely using memory in advance if it's a computation load
modelling from unknown input data.

I suggest you read the old Talluri paper "Superpassing the TLB Performance
of Superpages with Less Operating System Support" and pay attention to
Section 4. There it discusses a page reservation scheme whereby on fault
a naturally aligned set of base pages are reserved and only one correctly
placed base page is inserted into the faulting address. It was tied into
a hypothetical piece of hardware that doesn't exist to give best-effort
support for superpages so it does not directly help you but the initial
idea is sound. There are holes in the paper from todays perspective but
it was written in the 90's.

>From there, read "Transparent operating system support for superpages"
by Navarro, particularly chapter 4 paying attention to the parts where
it talks about opportunism and promotion threshold.

Superficially, it goes like this

1. On fault, reserve a THP in the allocator and use one base page that
is correctly-aligned for the faulting addresses. By correctly-aligned,
I mean that you use base page whose offset would be naturally contiguous
if it ever was part of a huge page.
2. On subsequent faults, attempt to use a base page that is naturally
aligned to be a THP
3. When a "threshold" of base pages are inserted, allocate the remaining
pages and promote it to a THP
4. If there is memory pressure, spill "reserved" pages into the main
allocation pool and lose the opportunity to promote (which will need
khugepaged to recover)

By definition, a promotion threshold of 1 would be the existing scheme
of allocation a THP on the first fault and some users will want that. It
also should be the default to avoid unexpected overhead. For workloads
where memory is being sparsely addressed and the increased overhead of
THP is unwelcome then the threshold should be tuned higher with a maximum
possible value of HPAGE_PMD_NR.

It's non-trivial to do this because at minimum a page fault has to check
if there is a potential promotion candidate by checking the PTEs around
the faulting address searching for a correctly-aligned base page that is
already inserted. If there is, then check if the correctly aligned base
page for the current faulting address is free and if so use it. It'll
also then need to check the remaining PTEs to see if both the promotion
threshold has been reached and if so, promote it to a THP (or else teach
khugepaged to do an in-place promotion if possible). In other words,
implementing the promotion threshold is both hard and it's not free.

However, if it did exist then the only tunable would be the "promotion
threshold" and applications would not need any special awareness of their
address space.

--
Mel Gorman
SUSE Labs

Next message: kbuild test robot: "Re: [PATCH 2/4] usb: dwc3: add dwc3 glue layer for UniPhier SoCs"
Previous message: Andy Lutomirski: "[PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems"
In reply to: Nitin Gupta: "Re: [PATCH v2] mm: Reduce memory bloat with THP"
Next in thread: Nitin Gupta: "Re: [PATCH v2] mm: Reduce memory bloat with THP"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]