Re: [RFC 3/4] mm, thp: try fault allocations only if we expect them to succeed

From: Vlastimil Babka
Date: Tue Jun 23 2015 - 12:23:25 EST


On 06/18/2015 03:20 AM, David Rientjes wrote:
On Mon, 11 May 2015, Vlastimil Babka wrote:

Since we track THP availability for khugepaged THP collapses, we can use it
also for page fault THP allocations. If khugepaged with its sync compaction
is not able to allocate a hugepage, then it's unlikely that the less involved
attempt on page fault would succeed, and the cost could be higher than THP
benefits. Also clear the THP availability flag if we do attempt and fail to
allocate during page fault, and set the flag if we are freeing a large enough
page from any context. The latter doesn't include merges, as that's a fast
path and unlikely to make much difference.


That depends on how long {scan,alloc}_sleep_millisecs are, so if
khugepaged fails to allocate a hugepage on all nodes, it sleeps for
alloc_sleep_millisecs (default 60s)

Waking up khugepaged earlier is handled in patch 4.

and then there's immediate memory
freeing, thp page faults don't happen again for 60s. That's scary to me
when thp_avail_nodes is clear, a large process terminates, and then
immediately starts back up.

The last hunk of this patch makes sure that freeing a >=HPAGE_PMD_ORDER page sets the thp availability bit so that scenario should be OK. This wouldn't handle merging of free pages to form a large enough page, but that should be rare enough to be negligible.

None of its memory is faulted as thp and
depending on how large it is, khugepaged may fail to allocate hugepages
when it wakes back up so it never scans (the only reason why
thp_avail_nodes was clear before it terminated originally).

I'm not sure that approach can work unless the inference of whether a
hugepage can be allocated at a given time is a very good indicator of
whether a hugepage can be allocated alloc_sleep_millisecs later, and I'm
afraid that's not the case.

So does the explanation above solve the concern?

I'm very happy that you're looking at thp fault latency and the role that
khugepaged can play in accepting responsibility for defragmentation,
though. It's an area that has caused me some trouble lately and I'd like
to be able to improve.

Good.

We see an immediate benefit when experimenting with doing synchronous
memory compactions of all memory every 15s. That's done using a cronjob
rather than khugepaged, but the idea is the same.

What would your thoughts be about doing something radical like

- having khugepaged do synchronous memory compaction of all memory at
regulary intervals,

I'm also thinking towards something like this for some time, yeah. Also maybe not khugepaged but per-node "kcompatd" that's handles just the compation and not thp collapses.

- track how many pageblocks are free for thp memory to be allocated,

That should be easy to determine from free lists already? There are per-order counts AFAIK, you just have to sum up over all zones and orders between pageblock order and MAX_ORDER (which should be just 1 or 2 orders).

- terminate collapsing if free pageblocks are below a threshold,

Why not.

- trigger a khugepaged wakeup at page fault when that number of
pageblocks falls below a threshold,

- determine the next full sync memory compaction based on how many
pageblocks were defragmented on the last wakeup, and

- avoid memory compaction for all thp page faults.

Right. That should also reduce the amount of GFP_TRANSHUGE decisions done in the allocator right now...

I think there are more benefits possible when a thread is responsible for thorough defragmentation and its activity is tuned appropriately (and doesn't depend on the collapse scanning results as it's now the case for khugepaged - it won't compact anything on a node if there's nothing to collapse there).

- direct compaction can quickly skip a block of memory in migrate scanner as soon as it finds a page that cannot be isolated. I had a patch for that [1], but dropped it due to longer-term fragmentation becoming worse.

- I think that direct compaction could also stop using the current free scanner and just get free pages from free lists. In my current testing I see that free scanner spends an awful lot of time to find those free pages, if we are near the watermarks. I think this approach should work better, combined with implementing the previous point:
- if the free page that came from the free list is within the order-aligned block that the migrate scanner is processing, then of course we don't use it as migration target. We keep the page aside on a list so it can later merge with the pages freed by migration.
- since getting pages from free lists is done in increasing order starting from 0, it would also have some natural antifragmentation effects. Right now the free scanner can be easily breaking an order-8 page to obtain one or few pages as migration targets.

Of course after such modifications direct compaction is no longer truly a "compaction", that's why complementing it with the traditional one done by a dedicated thread would be needed to avoid regressions in long-term fragmentation.

[1] http://www.spinics.net/lists/linux-mm/msg76307.html

(I'd ignore what is actually the responsibility of khugepaged and what is
done in task work at this time.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/