Re: [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations
From: Brendan Jackman
Date: Thu Jun 11 2026 - 10:47:04 EST
On Mon Jun 1, 2026 at 8:50 AM UTC, Vlastimil Babka (SUSE) wrote:
> On 5/29/26 17:02, Brendan Jackman wrote:
>> On Fri May 15, 2026 at 4:46 PM UTC, Brendan Jackman wrote:
>>> On Wed May 13, 2026 at 3:43 PM UTC, Vlastimil Babka (SUSE) wrote:
>> [...]
>>>> Uhh, speaking of compaction and reclaim... we rely on finding a whole free
>>>> pageblock in order to flip it. If that doesn't exist, the whole
>>>> get_page_from_freelist() will fail, and we might enter the
>>>> reclaim/compaction cycle in __allow_pages_slowpath(). But since we might
>>>> ultimately want an order-0 allocation, there won't be any compaction
>>>> attempted, because that code won't know we failed to flip a pageblock. And
>>>> the watermarks might look good and prevent reclaim as well I think? We
>>>> should somehow indicate this, and handle accordingly. Might not be trivial.
>>>> Or maybe reuse pageblock isolation code to do the migrations directly in
>>>> __rmqueue_direct_map?
>>>
>>> Ah, thanks, I suspect you are right.
>>>
>>> I did fear there would be some sort of case where this "not-quite
>>> reclaim" interacted badly with the actual reclaim, and I tried to test
>>> it by running some stuff in parallel with stress-ng (allocating
>>> __GFP_UNMAPPED via secretmem), and I didn't see a difference in the
>>> effective availability of memory. However, I suspect testing this is
>>> quite a deep art my "run these two commands that I copy pasted from an
>>> LLM suggestion" test was just crap.
>>>
>>> Do you have any workloads you can suggest for evaluating this kinda
>>> thing? We would definitely see it in Google prod (I think we see this
>>> kind of issue with our shrinker-based internal version of ASI distorting
>>> reclaim behaviour in ways even more subtle than this) but that is not a
>>> very practical experimental cycle...
>>
>> I slop-coded a benchmark:
>>
>> https://github.com/bjackman/kernel-benchmarks-nix/tree/master/packages/benchmarks/secretmem-vs-frag
>>
>> It does some mmap/munmap patterns to try and generate fragmentation,
>> then spams secretmem allocations until it gets OOM-killed.
>>
>> With this series, I see the OOM-kills happening noticeably sooner on a
>> 1GiB VM:
>>
>> metric: secretmem_allocated_bytes (B) | test: secretmem-vs-frag
>> +---------------------------------------------+---------+-------------+-------------+-----------------+-------------+-------+
>> | kernel_release | samples | mean | min | histogram | max | Δμ |
>> +---------------------------------------------+---------+-------------+-------------+-----------------+-------------+-------+
>> | 7.0.0-rc4-next-20260319 | 4 | 683,147,264 | 643,825,664 | █ | 715,128,832 | |
>> | 7.0.0-rc4-next-20260319-00028-gf00246eb72cd | 3 | 623,553,195 | 551,550,976 | ███ | 692,060,160 | -8.7% |
>> +---------------------------------------------+---------+-------------+-------------+-----------------+-------------+-------+
>>
>> So... I think maybe I've reproduced the issue you pointed out? I will
>> try and fix it and see if this degradation goes away.
>
> Since I assume the fragmentating allocations are movable allocations, it
> might be the case, yeah.
Alright, so I tried splitting NR_FREE_PAGES_BLOCKS into two counters to
track mapped vs unmapped blocks. Then I gave
compaction_suit_allocation_order() an 'unmapped' flag:
@@ -2510,19 +2510,39 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
static enum compact_result
compaction_suit_allocation_order(struct zone *zone, unsigned int order,
int highest_zoneidx, unsigned int alloc_flags,
- bool async, bool kcompactd)
+ bool unmapped, bool async, bool kcompactd)
{
unsigned long free_pages;
unsigned long watermark;
- if (kcompactd && defrag_mode)
+ /*
+ * Might need to generate a whole free block regardless of the actual
+ * allocation order:
+ *
+ * - When allocating an unmapped page, because the allocator only unmaps
+ * whole blocks at a time.
+ *
+ * Why doesn't this apply to the other way around too? (Mightn't we
+ * need to _map_ a whole block?) This is a temporary simplification:
+ * currently, unmapped blocks don't contain movable pages, so
+ * compaction isn't going to free up one of those.
+ *
+ * - In defrag_mode, because the allocator is unwilling to "steal" pages
+ * from the "wrong" block.
+ *
+ * Why is this only under kcompactd?
+ *
+ * Temporary simplification: unmapped pageblocks are currently
+ * nonmovable. So if the compactor is trying to service a
+ */
+ if (unmapped)
+ free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS_MAPPED);
+ else if (kcompactd && defrag_mode)
free_pages = zone_free_pages_blocks(zone);
else
free_pages = zone_page_state(zone, NR_FREE_PAGES);
... Then, I changed __alloc_pages_direct_compact() to try to try to
compact for a whole block whenever we are trying to allocate an unmapped
page (note I think there's an orthogonal bug here where it leaks memory
when there's a "captured" compaction):
index 4f04e897c5374..7eed22f3b26eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -824,6 +824,9 @@ compaction_capture(struct capture_control *capc, struct page *page,
capc_mt != MIGRATE_MOVABLE)
return false;
+ if (freetype_flags(freetype) != freetype_flags(capc->cc->freetype))
+ return false;
+
if (migratetype != capc_mt)
trace_mm_page_alloc_extfrag(page, capc->cc->order, order,
capc_mt, migratetype);
@@ -4469,20 +4472,27 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct page *page = NULL;
unsigned long pflags;
unsigned int noreclaim_flag;
+ unsigned int compact_order = order;
- if (!order)
+ // TODO: Is it OK to always run compaction like this?
+ /*
+ * Unmapped allocations benefit from compaction even at order 0, because the
+ * allocator will actually grab a whole block.
+ */
+ if (freetype_flags(ac->freetype) & FREETYPE_UNMAPPED)
+ compact_order = pageblock_order;
+
+ if (!compact_order)
return NULL;
psi_memstall_enter(&pflags);
delayacct_compact_start();
noreclaim_flag = memalloc_noreclaim_save();
- *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
- prio, &page);
+ // TODO: deal with captured page, if we changed the order it will have the
+ // wrong order. Also check it respects the freetype flags.
+ *compact_result = try_to_compact_pages(gfp_mask, compact_order,
+ alloc_flags, ac, prio, &page);
memalloc_noreclaim_restore(noreclaim_flag);
psi_memstall_leave(&pflags);
Full code:
https://github.com/bjackman/linux/tree/page_alloc-unmapped-2026-06-11
This makes the regression above (faster OOMs) go away, but it seems like
a pretty blunt approach. But then I'm realising I don't really know why it
matters? The main thing is presumably that we are more likely to
pointlessly attempt compaction or compact more than we need. But in that
case, aren't we already in a desperately slow path? Does a little bit of
extra work in __alloc_pages_direct_compact() really matter? I couldn't
measure it in a benchmark (kernel compilation alongside stress-ng
--secretmem).