Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation

From: Matthew Wilcox

Date: Sun May 03 2026 - 07:56:03 EST

On Sun, May 03, 2026 at 11:22:10AM +0530, Ritesh Harjani wrote:
> Now this is what I believe could be the reason for memory fragmentation
> with this workload -
> In Linux, each PTE page table uses 4KB size (assuming you are using 4KB
> system PAGE_SIZE). When your workload forks a
> child process for each new connection, child gets its own copy of the
> page tables which maps the shared buffer.
> Since each PTE table is a single 4KB page, hundreds of connections
> spawning means hundreds of thousands of single-page allocations for page
> tables. So it looks like, the major source of your memory fragmentation
> problem must be these several order-0 allocations for PTE page table
> pages.

While memory is fragmented, the _problem_ is that we try too hard to
defragment. From the original post:

: When memory is fragmented, each failed allocation triggers
: compaction and drain_all_pages() via __alloc_pages_slowpath()

We really should only try compaction once. If it didn't make useful
progress last time, it won't this time either.

> > | Patch | Run 1 | Run 2 | Run 3 | Average | % vs Baseline |
> > |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:|
> > | Baseline | 107,064.61 | 97,043.86 | 101,830.78 | 101,979.75 | — |
> > | Proposed patch | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20 | +38.45% |
> > | Ritesh's suggestion | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61 | +36.50% |
> > | Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82 | +41.07% |
>
>
> The main reason, why I proposed the below patch was because, this only
> affects costly order allocation (i.e for order > PAGE_ALLOC_COSTLY_ORDER)
> by skipping direct reclaim for those orders, while still keeping the
> behaviour same for others.
>
> So, for smaller orders (order > min_order and <=
> PAGE_ALLOC_COSTLY_ORDER), the allocator will still attempt for direct
> reclaim and compaction (which I guess is required to avoid oom too?) And
> also, this looks like a change which could be easily backportable :)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4e636647100c..f2343c26dd63 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2007,8 +2007,13 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
> gfp_t alloc_gfp = gfp;
>
> err = -ENOMEM;
> - if (order > min_order)
> - alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
> + if (order > min_order) {
> + alloc_gfp |= __GFP_NOWARN;
> + if (order > PAGE_ALLOC_COSTLY_ORDER)
> + alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> + else
> + alloc_gfp |= __GFP_NORETRY;
> + }
>
>
> But of course let's hear from others on their suggestions / thoughts.
> Maybe the filemap is not the right place to fix this as Matthew, Andrew
> and others were pointing. Any other suggestions on how to approach this,
> please?

filemap.c REALLY shouldn't know about PAGE_ALLOC_COSTLY_ORDER.
That's an internal detail of the memory allocator.

Either we want an API to say "allocate me a folio between orders A and B"
or we need more understandable GFP flags. Or the page allocator could
use the __GFP_NORETRY flag to say "oh well, this allocation has a fallback,
I'll kick kcompactd to try to compact some more memory, but I'll fail
the allocation".