[RFC 0/6] the big khugepaged redesign

From: Vlastimil Babka
Date: Mon Feb 23 2015 - 07:59:25 EST


Recently, there was concern expressed (e.g. [1]) whether the quite aggressive
THP allocation attempts on page faults are a good performance trade-off.

- THP allocations add to page fault latency, as high-order allocations are
notoriously expensive. Page allocation slowpath now does extra checks for
GFP_TRANSHUGE && !PF_KTHREAD to avoid the more expensive synchronous
compaction for user page faults. But even async compaction can be expensive.
- During the first page fault in a 2MB range we cannot predict how much of the
range will be actually accessed - we can theoretically waste as much as 511
worth of pages [2]. Or, the pages in the range might be accessed from CPUs
from different NUMA nodes and while base pages could be all local, THP could
be remote to all but one CPU. The cost of remote accesses due to this false
sharing would be higher than any savings on the TLB.
- The interaction with memcg are also problematic [1].

Now I don't have any hard data to show how big these problems are, and I
expect we will discuss this on LSF/MM (and hope somebody has such data [3]).
But it's certain that e.g. SAP recommends to disable THPs [4] for their apps
for performance reasons.

One might think that instead of fully disabling THP's it should be possible to
only disable (or make less aggressive, or limit to MADV_HUGEPAGE regions) THP's
for page faults and leave the collapsing up to khugepaged, which would hide the
latencies and allow better decisions based on how many base pages were faulted
in and from which nodes. However, looking more closely gives the impression
that khugepaged was meant rather as a rarely needed fallback for cases where
the THP page fault fails due to e.g. low memory. There are some tunables under
/sys/kernel/mm/transparent_hugepage/ but it doesn't seem sufficient for moving
the bulk of the THP work to khugepaged as it is.

- setting "defrag" to "madvise" or "never", while leaving khugepaged/defrag=1
will result in very lightweight THP allocation attempts during page faults.
This is nice and solves the latency problem, but not the other problems
described above. It doesn't seem possible to disable page fault THP's
completely without also disabling khugepaged.
- even if it was possible, the default settings for khugepaged are to scan up
to 8 PMD's and collapse up to 1 THP per 10 seconds. That's probably too slow
for some workloads and machines, but if one was to set this to be more
aggressive, it would become quite inefficient. Khugepaged has a single global
list of mm's to scan, which may include lot of tasks where scanning won't yield
anything. It should rather focus on tasks that are actually running (and thus
could benefit) and where collapses were recently successful. The activity
should also be ideally accounted to the task that benefits from it.
- scanning on NUMA systems will proceed even when actual THP allocations fail,
e.g. during memory pressure. In such case it should be better to save the
scanning time until memory is available.
- there were some limitations on which PMD's khugepaged can collapse. Thanks to
Ebru's recent patches, this should be fixed soon. With the
khugepaged/max_ptes_none tunable, one can limit the potential memory wasted.
But limiting NUMA false sharing is performed just in zone reclaim mode and
could be made stricter.

This RFC patchset doesn't try to solve everything mentioned above - the full
motivation is included for the bigger picture discussion, including LSF/MM.
The main part of this patchset is the move of collapse scanning from
khugepaged to the task_work context. This has been already discussed as a good
idea and a RFC has been posted by Alex Thorlton last October [5]. In that
prototype, scanning has been driven from __khugepaged_enter(), which is called
on events such as vma_merge, fork, and THP page fault attempts, e.g. events
that are not exactly periodical. The main difference in my patchset is it being
modeled after the automatic NUMA balancing scanning, i.e. using scheduler's
task_tick_fair(). The second difference is that khugepaged is not disabled
entirely, but repurposed for the costly hugepage allocations. There is a
nodemask for indicating which nodes should have hugepages easily available.
The idea is that hugepage allocation attempts from the process context
(either page fault or the task_work collapsing) would not attempt
reclaim/compaction, and on failure will clear the nodemask bit and wake up
khugepaged to do the hard work and flip the bit back on. If if appears that
ther are no hugepages available, attempts to page fault THP or scan are
suspended.

I have done only light testing so far to see that it works as intended, but
not to prove it's "better" than current state. I wanted to post the RFC before
LSF/MM.

There are known limitations and TODO/FIXME's in the code, for example:
- the scanning period doesn't yet adjust itself based on recent collapse
success/failure. The idea is that it will double itself if the last full
mm scan yielded no collapse.
- some user-visible counters for the task scanning activity should be added
- the change of THP allocation attempts pressure can have hard to predict
outcomes on fragmentation and the periods of deferred compaction.
- Documentation should be updated

More stuff is not decided yet:
- moving to task context from khugepaged results in the hugepage allocations
for collapsing not use sync compaction anymore. But should we also change
the "defrag" default from "always" to e.g. "madvise", which results in
~__GFP_WAIT allocations to further reduce latencies perceived from process
context?
- Would make sense to have one khugepaged instance per memory node, bound to
the corresponding CPU's? It would then touch local memory e.g. during
compaction migrations, and the khugepaged sleeps and wakeups would be more
fine-grained.
- should we keep using the tunables under /sys/.../khugepaged for the activity
that is mostly no longer performed by khugepaged, and when e.g. NUMA
balancing uses sysctl?

The stuff not touched by this patchset (nor decided):
- should we allow the user to disable THP page faults completely (or restrict
them to MADV_HUGEPAGE vma's?), or just assume that max_ptes_none < 511 means
it should be disabled, because we cannot know how much of the 2MB the process
would fault in?
- do we want to further limit collapses when base pages come from different
NUMA nodes? Another tunable (saying that minimum X pte's should be from
single node), or just hard-require that all are from the same node?

Patchset was lightly tested on v3.19 + 2 cherry-picked patches 077fcf116c8c and
be97a41b291, and rebased to 4.0-rc1 before sending.

[1] http://marc.info/?l=linux-mm&m=142056088232484&w=2
[2] http://www.spinics.net/lists/kernel/msg1928252.html
[3] http://www.spinics.net/lists/linux-mm/msg84023.html
[4] http://scn.sap.com/people/markmumy/blog/2014/05/22/sap-iq-and-linux-hugepagestransparent-hugepages
[5] https://lkml.org/lkml/2014/10/22/931

Vlastimil Babka (6):
mm, thp: stop preallocating hugepages in khugepaged
mm, thp: make khugepaged check for THP allocability before scanning
mm, thp: try fault allocations only if we expect them to succeed
mm, thp: move collapsing from khugepaged to task_work context
mm, thp: wakeup khugepaged when THP allocation fails
mm, thp: remove no longer needed khugepaged code

include/linux/khugepaged.h | 19 +-
include/linux/mm_types.h | 4 +
include/linux/sched.h | 5 +
kernel/fork.c | 1 -
kernel/sched/core.c | 12 +
kernel/sched/fair.c | 124 ++++++++-
mm/huge_memory.c | 628 ++++++++++++++-------------------------------
7 files changed, 339 insertions(+), 454 deletions(-)

--
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/