Re: [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages

From: Ge Yang
Date: Mon Jan 13 2025 - 21:52:29 EST




在 2025/1/13 23:46, Johannes Weiner 写道:
CC Vlastimil

On Wed, Jan 08, 2025 at 07:30:54PM +0800, yangge1116@xxxxxxx wrote:
From: yangge <yangge1116@xxxxxxx>

There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
of memory. I have configured 16GB of CMA memory on each NUMA node,
and starting a 32GB virtual machine with device passthrough is
extremely slow, taking almost an hour.

During the start-up of the virtual machine, it will call
pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
Long term GUP cannot allocate memory from CMA area, so a maximum of
16 GB of no-CMA memory on a NUMA node can be used as virtual machine
memory. There is 16GB of free CMA memory on a NUMA node, which is
sufficient to pass the order-0 watermark check, causing the
__compaction_suitable() function to consistently return true.
However, if there aren't enough migratable pages available, performing
memory compaction is also meaningless. Besides checking whether
the order-0 watermark is met, __compaction_suitable() also needs
to determine whether there are sufficient migratable pages available
for memory compaction.

For costly allocations, because __compaction_suitable() always
returns true, __alloc_pages_slowpath() can't exit at the appropriate
place, resulting in excessively long virtual machine startup times.
Call trace:
__alloc_pages_slowpath
if (compact_result == COMPACT_SKIPPED ||
compact_result == COMPACT_DEFERRED)
goto nopage; // should exit __alloc_pages_slowpath() from here

When the 16G of non-CMA memory on a single node is exhausted, we will
fallback to allocating memory on other nodes. In order to quickly
fallback to remote nodes, we should skip memory compaction when
migratable pages are insufficient. After this fix, it only takes a
few tens of seconds to start a 32GB virtual machine with device
passthrough functionality.

Signed-off-by: yangge <yangge1116@xxxxxxx>
---

V3:
- fix build error

V2:
- consider unevictable folios

mm/compaction.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index 07bd227..a9f1261 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2383,7 +2383,27 @@ static bool __compaction_suitable(struct zone *zone, int order,
int highest_zoneidx,
unsigned long wmark_target)
{
+ pg_data_t __maybe_unused *pgdat = zone->zone_pgdat;
+ unsigned long sum, nr_pinned;
unsigned long watermark;
+
+ sum = node_page_state(pgdat, NR_INACTIVE_FILE) +
+ node_page_state(pgdat, NR_INACTIVE_ANON) +
+ node_page_state(pgdat, NR_ACTIVE_FILE) +
+ node_page_state(pgdat, NR_ACTIVE_ANON) +
+ node_page_state(pgdat, NR_UNEVICTABLE);

What about PAGE_MAPPING_MOVABLE pages that aren't on this list? For
example, zsmalloc backend pages can be a large share of allocated
memory, and they are compactable. You would give up on compaction
prematurely and cause unnecessary allocation failures.

Yes, indeed, there are pages that are not in the LRU list but support migration. Currently, technologies such as balloon, z3fold, and zsmalloc are utilizing such pages. I feel that we could add an item to node_stat_item to keep statistics on these pages.

That scenario is way more common than the one you're trying to fix.

I think trying to make this list complete, and maintaining it, is
painstaking and error prone. And errors are hard to detect: they will
just manifest as spurious failures in higher order requests that you'd
need to catch with tracing enabled in the right moments.

So I'm not a fan of this approach.

Compaction is already skipped when previous runs were not successful.
See defer_compaction() and compaction_deferred(). Why is this not
helping here?

if (prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE ||
status == COMPACT_PARTIAL_SKIPPED))
defer_compaction(zone, order);

When prio != COMPACT_PRIO_ASYNC, defer_compaction(zone, order) will be executed. In the __alloc_page_slowpath() function, during the first execution of __alloc_pages_direct_compact(), prio is equal to COMPACT_PRIO_ASYNC, and therefore defer_compaction(zone, order) will not be executed. Instead, it will eventually proceed to the time-consuming __alloc_pages_direct_reclaim(). This can be avoided in scenarios where memory compaction is not suitable.


+ nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) -
+ node_page_state(pgdat, NR_FOLL_PIN_RELEASED);

Likewise, as Barry notes, not all pinned pages are necessarily LRU
pages. remap_vmalloc_range() pages come to mind. You can't do subset
math on potentially disjunct sets.

Indeed, some problem scenarios are unsolvable currently, but there are some scenarios that can be resolved through this approach. Currently, we haven't come up with a better solution yet.


+ /*
+ * Gup-pinned pages are non-migratable. After subtracting these pages,
+ * we need to check if the remaining pages are sufficient for memory
+ * compaction.
+ */
+ if ((sum - nr_pinned) < (1 << order))
+ return false;
+