[PATCH 4/4] mm: page_alloc: fix non-movable reclaim storm in defrag_mode

From: Johannes Weiner

Date: Fri Jun 26 2026 - 14:23:31 EST


As we deployed defrag_mode into Meta production, pressure spikes and
excessive swapping were observed on some workloads. Tracing confirmed
that this is unmovable/reclaimable requests spinning in the allocator
and direct reclaim, causing excessive amounts of swap.

The initial plan for defrag_mode was to rely on kswapd/kcompactd to
produce blocks, and if those are overwhelmed under high pressure, let
the allocator fall back (__rmqueue_steal()) after its retry loops.
However, that retrying results in more reclaim on some of these
workloads than we'd hoped, sometimes excessively so, spurred on by the
!costly order conditions in should_reclaim_retry().

The storms are dependent on the request type. Reclaim will inevitably
make room in existing movable blocks, since that's where the LRU pages
live. So if movable requests retry on reclaim, they make progress.

When non-movable requests spin in reclaim that isn't productive. They
cannot use the individually freed pages, and the process is unlikely
to accidentally free whole blocks to meet the ALLOC_NOFRAGMENT bar.
They spin and overreclaim excessively, which tanks performance and
triggers userspace guards like swap exhaustion or pressure based OOM.

To fix this, send non-movable requests, regardless of order, into
pageblock reclaim/compaction. This way, they help move things along to
meet the ALLOC_NOFRAGMENT bar. After this patch, the reclaim storms
and excess OOM rates are no longer observed in production.

The longer-term plan is still to have all requests, including the
movable ones, help make blocks to spread the cost of defragmenting
more evenly and fairly; combined with proper watermarking to reduce
allocation latencies in the common case. However, doing this naively
unearths scaling and concurrency limitations in compaction that need
to be addressed first. Promoting just non-movables for now is the
minimally viable bug fix for the above issue.

Fixes: f38356df6474 ("mm: page_alloc: introduce defrag_mode")
Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
---
mm/internal.h | 7 +++++++
mm/page_alloc.c | 36 +++++++++++++++++++++++++++++-------
2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 181e79f1d6a2..1f636cfc859a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1060,6 +1060,13 @@ struct compact_control {
*/
struct capture_control {
struct compact_control *cc;
+ /*
+ * Allocation request order. May differ from the compaction
+ * order: defrag_mode promotes sub-block allocations to
+ * pageblock-order compaction; capture still matches at the
+ * original allocation order so prep_new_page() is consistent.
+ */
+ int order;
struct page *page;
};

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dee1c47e795..575a99a4c723 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -728,7 +728,7 @@ static inline bool
compaction_capture(struct capture_control *capc, struct page *page,
int order, int migratetype)
{
- if (!capc || order != capc->cc->order)
+ if (!capc || order != capc->order)
return false;

/* Do not accidentally pollute CMA or isolated regions*/
@@ -748,7 +748,7 @@ compaction_capture(struct capture_control *capc, struct page *page,
return false;

if (migratetype != capc->cc->migratetype)
- trace_mm_page_alloc_extfrag(page, capc->cc->order, order,
+ trace_mm_page_alloc_extfrag(page, capc->order, order,
capc->cc->migratetype, migratetype);

capc->page = page;
@@ -4147,10 +4147,27 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
unsigned long pflags;
unsigned int noreclaim_flag;
struct capture_control capc = {
+ .order = order,
.page = NULL,
};
+ int compact_order = order;

- if (!order)
+ /*
+ * If fallbacks are not permitted (defrag_mode), we either
+ * need to reclaim space in a block of matching type, or clear
+ * out an entire block to allow __rmqueue_claim() to convert.
+ *
+ * Reclaim by itself is primarily freeing space in movable
+ * blocks, since that's where the LRU pages live. So this
+ * works for movable requests, but not for others.
+ *
+ * For those, promote the order to help make blocks, instead
+ * of spinning in reclaim alone unproductively.
+ */
+ if ((alloc_flags & ALLOC_NOFRAGMENT) && ac->migratetype != MIGRATE_MOVABLE)
+ compact_order = max(order, pageblock_order);
+
+ if (!compact_order)
return NULL;

/*
@@ -4166,8 +4183,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
fs_reclaim_acquire(gfp_mask);
noreclaim_flag = memalloc_noreclaim_save();

- *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
- prio, &capc);
+ *compact_result = try_to_compact_pages(gfp_mask, compact_order,
+ alloc_flags, ac, prio, &capc);

memalloc_noreclaim_restore(noreclaim_flag);
fs_reclaim_release(gfp_mask);
@@ -4203,7 +4220,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zone *zone = page_zone(page);

zone->compact_blockskip_flush = false;
- compaction_defer_reset(zone, order, true);
+ compaction_defer_reset(zone, compact_order, true);
count_vm_event(COMPACTSUCCESS);
return page;
}
@@ -4443,9 +4460,14 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
struct page *page = NULL;
unsigned long pflags;
bool drained = false;
+ int reclaim_order = order;
+
+ /* Match the slowpath compaction promotion in __alloc_pages_direct_compact */
+ if ((alloc_flags & ALLOC_NOFRAGMENT) && ac->migratetype != MIGRATE_MOVABLE)
+ reclaim_order = max(order, pageblock_order);

psi_memstall_enter(&pflags);
- *did_some_progress = __perform_reclaim(gfp_mask, order, ac);
+ *did_some_progress = __perform_reclaim(gfp_mask, reclaim_order, ac);
if (unlikely(!(*did_some_progress)))
goto out;

--
2.54.0