[PATCH 0/5] mm: reliable huge page allocator

From: Johannes Weiner
Date: Thu Mar 13 2025 - 17:07:06 EST


This series makes changes to the allocator and reclaim/compaction code
to try harder to avoid fragmentation. As a result, this makes huge
page allocations cheaper, more reliable and more sustainable.

It's a subset of the huge page allocator RFC initially proposed here:

https://lore.kernel.org/lkml/20230418191313.268131-1-hannes@xxxxxxxxxxx/

The following results are from a kernel build test, with additional
concurrent bursts of THP allocations on a memory-constrained system.
Comparing before and after the changes over 15 runs:

before after
Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%)
Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%)
Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%)
Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%)
Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%)
THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%)
THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%)
Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%)
Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%)
Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%)
Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%)
Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%)
Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%)
Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%)
Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%)
Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%)
Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%)
Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%)
Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%)
Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%)
Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%)
Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%)
Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%)
Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%)
Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%)
File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%)

THP latencies are cut in half, and failure rates are cut by 75%. These
metrics also hold up over time, while the vanilla kernel sees a steady
downward trend in success rates with each subsequent run, owed to the
cumulative effects of fragmentation.

A more detailed discussion of results is in the patch changelogs.

The patches first introduce a vm.defrag_mode sysctl, which enforces
the existing ALLOC_NOFRAGMENT alloc flag until after reclaim and
compaction have run. They then change kswapd and kcompactd to target
pageblocks, which boosts success in the ALLOC_NOFRAGMENT hotpaths.

Main differences to the RFC:

- The freelist hygiene patches have since been upstreamed separately.

- The RFC version would prohibit fallbacks entirely, and make
pageblock reclaim and compaction mandatory for all allocation
contexts. This opens up a large dependency graph for compaction,
possibly remaining sources of pollution, and the handling of
low-memory situations, OOMs and deadlocks.

This version uses only kswapd & kcompactd to pre-produce pageblocks,
while still allowing last-ditch fallbacks to avoid memory deadlocks.

The long-term goal remains converging on the version proposed in the
RFC and its ~100% THP success rate. But this is reserved for future
iterations that can build on the changes proposed here.

- The RFC version proposed a new MIGRATE_FREE type as well as
per-migratetype counters. This allowed making compaction more
efficient, and the pre-compaction gap checks more precise, but again
at the cost of complex changes in an already invasive series.

This series simply uses a new vmstat counter to track the number of
free pages in whole blocks to base reclaim/compaction goals on.

- The behavior is opt-in and can be toggled at runtime. The risk for
regressions with any allocator change is sizable, and while many
users care about huge pages, obviously not all do. A runtime knob is
warranted to make the behavior optional and provide an escape hatch.

Based on today's akpm/mm-unstable.

Patches #1 and #2 are somewhat unrelated cleanups, but touch the same
code and so included here to avoid conflicts from re-ordering.

Documentation/admin-guide/sysctl/vm.rst | 9 ++++
include/linux/compaction.h | 5 +-
include/linux/mmzone.h | 1 +
mm/compaction.c | 87 ++++++++++++++++++++-----------
mm/internal.h | 1 +
mm/page_alloc.c | 72 +++++++++++++++++++++----
mm/vmscan.c | 41 ++++++++++-----
mm/vmstat.c | 1 +
8 files changed, 161 insertions(+), 56 deletions(-)