Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation

From: Vlastimil Babka

Date: Tue Apr 21 2026 - 05:06:10 EST


On 4/6/26 00:43, Dave Chinner wrote:
> On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote:
>> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
>> introduced high-order folio allocations in the buffered write
>> path. When memory is fragmented, each failed allocation triggers
>> compaction and drain_all_pages() via __alloc_pages_slowpath(),
>> causing a 0.75x throughput drop on pgbench (simple-update) with
>> 1024 clients on a 96-vCPU arm64 system.
>>
>> Strip __GFP_DIRECT_RECLAIM from folio allocations in
>> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
>> making them purely opportunistic.
>>
>> Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
>> Cc: stable@xxxxxxxxxxxxxxx
>> Signed-off-by: Salvatore Dipietro <dipiets@xxxxxxxxx>

BTW, backporting perf regressions fixes to 6.6, when they are only reported
at the time 7.0 is released, might be too risky. There will likely be a
different workload that will regress as a result, no matter what we do.

>> ---
>> fs/iomap/buffered-io.c | 15 ++++++++++++++-
>> 1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 92a831cf4bf1..cb843d54b4d9 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
>> struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>> {
>> fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
>> + gfp_t gfp;
>>
>> if (iter->flags & IOMAP_NOWAIT)
>> fgp |= FGP_NOWAIT;
>> @@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>> fgp |= FGP_DONTCACHE;
>> fgp |= fgf_set_order(len);
>>
>> + gfp = mapping_gfp_mask(iter->inode->i_mapping);
>> +
>> + /*
>> + * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER,
>> + * strip __GFP_DIRECT_RECLAIM to make the allocation purely
>> + * opportunistic. This avoids compaction + drain_all_pages()
>> + * in __alloc_pages_slowpath() that devastate throughput
>> + * on large systems during buffered writes.
>> + */
>> + if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER)
>> + gfp &= ~__GFP_DIRECT_RECLAIM;
>
> Adding these "gfp &= ~__GFP_DIRECT_RECLAIM" hacks everywhere
> we need to do high order folio allocation is getting out of hand.
>
> Compaction improves long term system performance, so we don't really
> just want to turn it off whenever we have demand for high order
> folios.
>
> We should be doing is getting rid of compaction out of the direct
> reclaim path - it is -clearly- way too costly for hot paths that use
> large allocations, especially those with fallbacks to smaller
> allocations or vmalloc.
>
> Instead, memory reclaim should kick background compaction and let it
> do the work. If the allocation path really, really needs high order
> allocation to succeed, then it can direct the allocation to retry
> until it succeeds and the allocator itself can wait for background
> compaction to make progress.
>
> For code that has fallbacks to smaller allocations, then there is no
> need to wait for compaction - we can attempt fast smaller allocations
> and continue that way until an allocation succeeds....

So, should we do a LSF/MM session?

But I think in any case, the page allocator needs to know which allocations
do have the fallback. __GFP_NORETRY exists for this. Here it wasn't tried at
all, in v2 [1] it was, but not alone. I'd start from __GFP_NORETRY alone,
and then we can look at tweaking what it does if it's currently insufficient.

We could have a helper to encapsulate this "turn this allocation to a
lightweight fallbackable one", which would add __GFP_NORETRY. It probably
already exists somewhere but not gfp.h. But I'm not sure we can simply
change GFP_KERNEL to start failing more for non-costly orders. We've
discussed that a lot in the past :)

[1] https://lore.kernel.org/all/20260420161404.642-1-dipiets@xxxxxxxxx/

> -Dave.