Re: [PATCH v2] mm/page_alloc: fix defrag_mode for non-reclaimable allocations

From: Johannes Weiner

Date: Tue May 26 2026 - 13:52:08 EST

On Tue, May 26, 2026 at 03:13:09PM +0200, Vlastimil Babka (SUSE) wrote:
> On 5/22/26 3:05 PM, Dmitry Ilvokhin wrote:
> > On Thu, May 21, 2026 at 04:59:10PM -0700, Andrew Morton wrote:
> >> On Wed, 20 May 2026 12:22:28 +0000 Dmitry Ilvokhin <d@xxxxxxxxxxxx> wrote:
> >>
> >>> When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent
> >>> migratetype fallbacks and keep pageblocks clean. The allocator relies on
> >>> reclaim and compaction to free pages of the correct type before allowing
> >>> fallback as a last resort.
> >>>
> >>> However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke
> >>> direct reclaim or compaction. With defrag_mode=1, these allocations hit
> >>> the !can_direct_reclaim bailout in __alloc_pages_slowpath() with
> >>> ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback.
> >>>
> >>> This causes a large number of SLUB allocation failures for
> >>> skbuff_head_cache under network-heavy workloads, despite free memory
> >>> being available in other migratetype freelists.
> >>
> >> That sounds painful.
> >>
> >>> Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd
> >>> reclaim but cannot do direct reclaim themselves (GFP_ATOMIC). Purely
> >>> speculative allocations like GFP_TRANSHUGE_LIGHT that don't set
> >>> __GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable
> >>> fallbacks and should not cause fragmentation.
> >>
> >> How serious is this to our users when running real-world workloads?
> >
> > We observed it on a few of the Meta workloads that adopted
> > defrag_mode=1.
>
> Do you (or Johannes) have some observations to share about what
> motivated those to adopt it, what kind of workloads benefit and how?

As you may remember it was developed to help with higher order / THP
success rates under pressure.

The impetus for actually deploying it was that we saw issues with
avalanches of large page cache folios vacuuming up the higher-order
chunks; this (ironically) also led to failures on the network side.

It's kind of a structural problem. We have real preproduction buffers
for order-0 pages through the watermarks. But for higher orders we
only ensure there is at least one page. That easily fails under even
mild competition.

Since we wanted to roll defrag_mode for THP in multi-tenant systems
anyway, we figured we might as well take the plunge now and battle
test the feature this way.

defrag_mode fixes *that* issue, by preproducing watermark buffers in
contiguous pageblocks - making everything up to that order more
readily available. I'm still hoping to make it the default eventually,
which was the plan with the original huge page allocator series. As we
keep leaning into higher order requests more and more, and especially
grow the non-optional ones, we kind of need non-optional preproduction
guarantees for higher orders as well.

But there are bugs like this one, and we're still figuring out some
overreclaim issues with it in production as well. So I'm glad it's
optional for the time being ;-)