[RFC PATCH 37/45] mm/slub: kvmalloc — add __GFP_NORETRY to large-kmalloc attempt

From: Rik van Riel

Date: Thu Apr 30 2026 - 16:26:06 EST


From: Rik van Riel <riel@xxxxxxxx>

kvmalloc's contract is "try contiguous physical memory first; fall
back to vmalloc on failure." For size > PAGE_SIZE, kmalloc_gfp_adjust
already strips __GFP_DIRECT_RECLAIM and adds __GFP_NOWARN to make
the kmalloc attempt non-disruptive. But the page allocator's atomic-
allocation retry chain in get_page_from_freelist (no __GFP_DIRECT_RECLAIM
path) progressively relaxes ALLOC_NOFRAGMENT — first adding
ALLOC_NOFRAG_TAINTED_OK, then dropping ALLOC_NOFRAGMENT entirely —
because atomic allocations have no slowpath escape and need every
chance to succeed.

For kvmalloc-large, this is wrong: there IS a slowpath escape (the
vmalloc fallback). Tainting a previously-clean superpageblock to
satisfy the kmalloc attempt costs more than letting it fail and
calling vmalloc — the SPB stays tainted for the rest of the workload's
lifetime, blocking 1 GiB hugepage allocation from that region.

Add __GFP_NORETRY in the same conditional that strips __GFP_DIRECT_RECLAIM.
The page allocator's NORETRY-skip exit (mm/page_alloc.c) treats this
as the documented "caller has a fallback" signal and returns NULL
immediately instead of relaxing ALLOC_NOFRAGMENT. kvmalloc then runs
its existing vmalloc fallback as designed.

kvmalloc's documented contract already disallows callers passing
__GFP_NORETRY directly (see the comment block above
__kvmalloc_node_noprof), so adding it internally cannot surprise
any existing caller.

Observed on a 247 GB devvm running the page-superblock v18
series: a `below` process reading a /proc/sys file via
kvmalloc(buf, GFP_USER) tainted a fresh clean SPB at
boot+~47 min via __kmalloc_large_node → alloc_pages_mpol. A
tls-cert-validator did the same a minute later. Both were "best
effort" allocations with vmalloc as their existing fallback — they
should not have been tainting clean SPBs.

Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
mm/slub.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 2b2d33cc735c..fa422d245a53 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -6703,13 +6703,24 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
* However make sure that larger requests are not too disruptive - i.e.
* do not direct reclaim unless physically continuous memory is preferred
* (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to
- * start working in the background
+ * start working in the background.
+ *
+ * Also signal __GFP_NORETRY: the vmalloc fallback IS our retry path,
+ * so the page allocator should not go to extreme lengths (e.g.
+ * tainting a previously-clean superpageblock from the page-superblock
+ * series) just to satisfy the kmalloc attempt. The atomic-allocation
+ * relaxation logic in get_page_from_freelist treats __GFP_NORETRY as
+ * "caller has a fallback" and returns NULL early instead of dropping
+ * ALLOC_NOFRAGMENT. kvmalloc's documented contract already disallows
+ * callers passing __GFP_NORETRY directly, so adding it here is safe.
*/
if (size > PAGE_SIZE) {
flags |= __GFP_NOWARN;

- if (!(flags & __GFP_RETRY_MAYFAIL))
+ if (!(flags & __GFP_RETRY_MAYFAIL)) {
flags &= ~__GFP_DIRECT_RECLAIM;
+ flags |= __GFP_NORETRY;
+ }

/* nofail semantic is implemented by the vmalloc fallback */
flags &= ~__GFP_NOFAIL;
--
2.52.0