Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred

From: David Rientjes
Date: Fri Dec 30 2016 - 17:31:07 EST

On Fri, 30 Dec 2016, Mel Gorman wrote:

> Michal is correct in that my intent for defer was to have "never stall"
> as the default behaviour. This was because of the number of severe stalls
> users experienced that lead to recommendations in tuning guides to always
> disable THP. I'd also seen multiple instances in bug reports for stalls
> where it was suggested that THP be disabled even when it could not have
> been a factor. It would be preferred to keep the default behaviour to
> avoid reintroducing such bugs.

I sympathize with that, I've dealt with a number of issues that we have
encountered where thp defrag was either at fault or wasn't, and there were
also suggestions to set defrag to "madvise" to rule it out and that
impacted other users.

I'm curious if you could show examples where there were severe stalls
being encountered by applications that did madvise(MADV_HUGEPAGE) and
users were forced to set madvise to "never". That is, after all, the only
topic for consideration in this thread: the direct impact to users of
madvise(MADV_HUGEPAGE). If an application does it, I believe that's a
demand for work to be done at allocation time to try to get hugepages.
They can certainly provide an application-level option to not do the
MADV_HUGEPAGE. Qemu is no different, you can add options to do
madvise(MADV_HUGEPAGE) or not, and you can also do it after fault.

The problem with the current option set is that we don't have the ability
to trigger background compaction for everybody, which only very minimally
impacts their page fault latency since it just wakes up kcompactd, and
allow MADV_HUGEPAGE users to accept that up-front cost by doing direct
compaction. My usecase, remapping .text segment and faulting thp memory
at startup, demands that ability. Setting defrag=madvise gets that
behavior, but nobody else triggers background compaction when thp memory
fails and we _want_ that behavior so work is being done to defrag.
Setting defrag=defer makes MADV_HUGEPAGE a no-op for page fault, and I
argue that's the wrong behavior.

> I'll neither ack nor nak this patch. However, I would much prefer an
> additional option be added to sysfs called defer-fault that would avoid
> all fault-based stalls but still potentially stall for MADV_HUGEPAGE. I
> would also prefer that the default option is "defer" for both MADV_HUGEPAGE
> and faults.

If you want a fifth option added to sysfs for thp defrag, that's fine, we
can easily do that. I'm slightly concerned with more and more options
added that we will eventually approach the 2^4 option count that I
mentioned earlier and nobody will know what to select. I'm fine with the
kernel default remaining as "madvise," we will just set it to whatever
gets us "direct for madvise, background for everybody else" behavior as we
were planning on using "defer."

We can either do

(1) merge this patch and allow madvise(MADV_HUGEPAGE) users to always try
to get hugepages, potentially adding options to qemu to suppress
their MADV_HUGEPAGE if users have complained (would even fix the
issue on 2.6 kernels) or do it after majority has been faulted, or

(2) add a fifth defrag option to do this suggested behavior and maintain
that option forever.

I'd obviously prefer the former since I consider MADV_HUGEPAGE and not
willing to stall as a userspace issue that can _trivially_ be worked
around in userspace, but in the interest of moving forward on this we can
do the latter if you'd prefer.