Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation

From: Karim Manaouil

Date: Sun May 31 2026 - 19:30:05 EST

On Wed, May 27, 2026 at 04:24:10PM +0000, Salvatore Dipietro wrote:
>
> Thanks Ritesh and Matthew for the continued feedback and guidance on this thread.
> I'd like to summarize where we stand and ask for your input on the best path forward.
>
> Summary of approaches tested:
> We've now benchmarked all proposed variations (pgbench simple-update, 1024 clients,
> 96-vCPU arm64, huge_pages=off, PREEMPT_NONE applied [1]):
>
> | Patch | Change Location | Avg TPS | % vs Baseline |
> |--------------------------------|-----------------------|-----------:|:-------------:|
> | Baseline (no patch) | — | 101,979.75 | — |
> | v1 (original, iomap caller) | fs/iomap/buffered-io.c| 141,194.20 | +38.45% |
> | Ritesh's suggestion | mm/filemap.c | 139,200.61 | +36.50% |
> | Matthew's suggestion | mm/filemap.c | 143,863.82 | +41.07% |
> | kcompactd background | mm/page_alloc.c | 134,278.47 | +31.67% |
>
>
> All approaches recover significant throughput. The kcompactd approach (background
> compaction and returning nopage for costly orders with __GFP_NORETRY) aligns with the
> architectural direction Dave and Christoph proposed, keeping compaction out of the direct
> reclaim path, and lives entirely in the page allocator.
>
> Based on the discussion, I see two possible directions and would appreciate your guidance:
>
> 1. Page allocator fix (mm/page_alloc.c): The kcompactd background approach addresses
> Matthew's concern that filemap.c shouldn't know about PAGE_ALLOC_COSTLY_ORDER, and aligns
> with Dave's vision of removing compaction from the direct reclaim path.
>
> 2. filemap fix (mm/filemap.c): Both Ritesh's and Matthew's suggestions are minimal,
> backportable, and preserve lightweight reclaim for non-costly orders.
> Ritesh's variant differentiates between costly and non-costly orders, while Matthew's
> is simpler and performs best.

I am not very familiar with THPs in the page cache, but for anonymous
memory, we have /sys/kernel/mm/transparent_hugepages/defrag which
decides what to do in the event of a THP allocation failure, whether to
enter a synchronous compaction or wake up kcompactd.

Check vma_thp_gfp_mask(). Maybe you should adopt something similar called
file_thp_gfp_mask().

The problem with fallback is that your application is never going to get
a THP and eventually TLB pressure might actually end up slowing you
down in the long run.

Also compaction is only really tried if it makes sense. That is if
enough free memory is available to actually perform the compaction and
have a chance of creating a large enough huge page. So compaction is
actually never performed under accute memory pressure. Which means your
system actually has enough free pages, but somehow the compaction is
slow and inefficient.

I am just trying to think loudly here and address the root cause. The
real problem here is fragmentation due to unmovable pages, probably in
your case the page tables. We should work more on reducing pageblock
type mixing. Also page tables can actually be made movable so that
compaction can treat them as movable pages.

>
> Would either of these directions be acceptable for a v3, or would you prefer a different approach?
>
> I'm happy to test any additional variations or direction to move this forward
>
> Salvatore
>
>
> [1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@xxxxxxxxx/T/#m8baeeaf48aa7ae5342c8c2db8f4e1c27e03c1368
>
>
>
>
> AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico
>
>

--
~karim