Re: [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations

From: Vlastimil Babka (SUSE)

Date: Mon Jun 01 2026 - 05:02:50 EST

On 5/15/26 18:46, Brendan Jackman wrote:
> On Wed May 13, 2026 at 3:43 PM UTC, Vlastimil Babka (SUSE) wrote:
>> On 3/20/26 19:23, Brendan Jackman wrote:
>>> Currently __GFP_UNMAPPED allocs will always fail because, although the
>>> lists exist to hold them, there is no way to actually create an unmapped
>>> page block. This commit adds one, and also the logic to map it back
>>> again when that's needed.
>>>
>>> Doing this at pageblock granularity ensures that the pageblock flags can
>>> be used to infer which freetype a page belongs to. It also provides nice
>>> batching of TLB flushes, and also avoids creating too much unnecessary
>>> TLB fragmentation in the physmap.
>>>
>>> There are some functional requirements for flipping a block:
>>>
>>> - Unmapping requires a TLB shootdown, meaning IRQs must be enabled.
>>>
>>> - Because the main usecase of this feature is to protect against CPU
>>> exploits, when a block is mapped it needs to be zeroed to ensure no
>>> residual data is available to attackers. Zeroing a block with a
>>> spinlock held seems undesirable.
>>
>> Did I overlook something or this patch doesn't do this whole block zeroing?
>> Or is it handled by set_direct_map_valid_noflush itself?
>
> Oops. At some point I was planning to defer the zeroing to another
> series. I changed my mind about that but, apparently I forgot to
> actually add the code back.
>
> The code I deleted was in __rmqueue_direct_map() like this:
>
> if (want_mapped) {
> <zero the block>
> } else {
> unsigned long start = (unsigned long)page_address(page);
> unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT));
>
> flush_tlb_kernel_range(start, end);
> }
>
> But actually I'm not sure that's what we want: At the moment, there's
> actually a race condition when allocating __GFP_UNMAPPED|__GFP_ZERO:
>
> 1. Take page off freelist
> 2. Mermap it
> 3. Zero it
> 4. Mer-unmap it
>
> I don't know, but some sort of CPU attack might support exploiting the
> gap between 2 and 3 to leak any data left behind from a prior
> allocation. (Like, maybe you can get the data into a uarch buffer during
> the race window, then leak that data afterwards at leisure).

I can't imagine how it would work, but then novel CPU attacks might be
beyond my imagination :) But I think we can ignore hypothetical CPU attacks
for now?

> To mitigate that, we might want to effectively enforce
> want_init_on_free() for unmapped blocks. And, if we do that, we
> don't actually need to zero the block when flipping it back to mapped,
> since there shouldn't be any user data in there.
>
> Any thoughts on that? I have not tried to implement it yet, I might be
> missing something that makes it impractical. Also I haven't read that
> series that's doing zeroing through user addresses either, this might
> have an interesting interaction with that.

I think let's try the simplest way first.

>>> - Updating the pagetables might require allocating a pagetable to break
>>> down a huge page. This would deadlock if the zone lock was held.
>>>
>>> This makes allocations that need to change sensitivity _somewhat_
>>> similar to those that need to fallback to a different migratetype. But,
>>> the locking requirements mean that this can't just be squashed into the
>>> existing "fallback" allocator logic, instead a new allocator path just
>>> for this purpose is needed.
>>>
>>> The new path is assumed to be much cheaper than the really heavyweight
>>> stuff like compaction and reclaim. But at present it is treated as less
>>
>> Uhh, speaking of compaction and reclaim... we rely on finding a whole free
>> pageblock in order to flip it. If that doesn't exist, the whole
>> get_page_from_freelist() will fail, and we might enter the
>> reclaim/compaction cycle in __allow_pages_slowpath(). But since we might
>> ultimately want an order-0 allocation, there won't be any compaction
>> attempted, because that code won't know we failed to flip a pageblock. And
>> the watermarks might look good and prevent reclaim as well I think? We
>> should somehow indicate this, and handle accordingly. Might not be trivial.
>> Or maybe reuse pageblock isolation code to do the migrations directly in
>> __rmqueue_direct_map?
>
> Ah, thanks, I suspect you are right.
>
> I did fear there would be some sort of case where this "not-quite
> reclaim" interacted badly with the actual reclaim, and I tried to test
> it by running some stuff in parallel with stress-ng (allocating
> __GFP_UNMAPPED via secretmem), and I didn't see a difference in the
> effective availability of memory. However, I suspect testing this is
> quite a deep art my "run these two commands that I copy pasted from an
> LLM suggestion" test was just crap.
>
> Do you have any workloads you can suggest for evaluating this kinda
> thing? We would definitely see it in Google prod (I think we see this
> kind of issue with our shrinker-based internal version of ASI distorting
> reclaim behaviour in ways even more subtle than this) but that is not a
> very practical experimental cycle...

Your test seems a good way to start.
I realized afterwards that the solution might be something similar to how we
handle ALLOC_NOFRAGMENT.

>>>
>>> +#ifdef CONFIG_PAGE_ALLOC_UNMAPPED
>>> +/* Try to allocate a page by mapping/unmapping a block from the direct map. */
>>> +static inline struct page *
>>> +__rmqueue_direct_map(struct zone *zone, unsigned int request_order,
>>> + unsigned int alloc_flags, freetype_t freetype)
>>> +{
>>> + unsigned int ft_flags_other = freetype_flags(freetype) ^ FREETYPE_UNMAPPED;
>>> + freetype_t ft_other = migrate_to_freetype(free_to_migratetype(freetype),
>>> + ft_flags_other);
>>> + bool want_mapped = !(freetype_flags(freetype) & FREETYPE_UNMAPPED);
>>> + enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
>>
>> Why not RMQUEUE_CLAIM? We want to change the migratetype to ours as well,
>> not just the unmapped flag?
>
> Oh right, actually I think we need to do RMQUEUE_CLAIM _and_
> RMQUEUE_NORMAL (or, some variant of RMQUEUE_CLAIM that also supports
> allocating from blocks that already have the requested migratetype).
>
> If we just switch it over to just RMQUEUE_CLAIM right now, while only
> one migrateteype supports FREETYPE_UNMAPPED, I think that would actually
> be broken: When allocating an unmapped block, (want_mapped=true) we
> would always hit the freetype_idx<0 case in find_suitable_fallback().

Right.

> But yeah we do need to do RMQUEUE_CLAIM too otherwise we'll miss
> opportunities to allocate from other unmapped freetypes once those
> exist.
>
>>> + unsigned long irq_flags;
>>> + int nr_pageblocks;
>>> + struct page *page;
>>> + int alloc_order;
>>> + int err;
>>> +
>>> + if (freetype_idx(ft_other) < 0)
>>> + return NULL;
>>> +
>>> + /*
>>> + * Might need a TLB shootdown. Even if IRQs are on this isn't
>>> + * safe if the caller holds a lock (in case the other CPUs need that
>>> + * lock to handle the shootdown IPI).
>>> + */
>>> + if (alloc_flags & ALLOC_NOBLOCK)
>>> + return NULL;
>>> +
>>> + if (!can_set_direct_map())
>>> + return NULL;
>>> +
>>> + lockdep_assert(!irqs_disabled() || unlikely(early_boot_irqs_disabled));
>>> +
>>> + /*
>>> + * Need to [un]map a whole pageblock (otherwise it might require
>>> + * allocating pagetables). First allocate it.
>>> + */
>>> + alloc_order = max(request_order, pageblock_order);
>>> + nr_pageblocks = 1 << (alloc_order - pageblock_order);
>>> + zone_lock_irqsave(zone, irq_flags);
>>> + page = __rmqueue(zone, alloc_order, ft_other, alloc_flags, &rmqm);
>>> + zone_unlock_irqrestore(zone, irq_flags);
>>> + if (!page)
>>> + return NULL;
>>> +
>>> + /*
>>> + * Now that IRQs are on it's safe to do a TLB shootdown, and now that we
>>> + * released the zone lock it's possible to allocate a pagetable if
>>> + * needed to split up a huge page.
>>> + *
>>> + * Note that modifying the direct map may need to allocate pagetables.
>>> + * What about unbounded recursion? Here are the assumptions that make it
>>> + * safe:
>>> + *
>>> + * - The direct map starts out fully mapped at boot. (This is not really
>>> + * an assumption" as its in direct control of page_alloc.c).
>>> + *
>>> + * - Once pages in the direct map are broken down, they are not
>>> + * re-aggregated into larger pages again.
>>> + *
>>> + * - Pagetables are never allocated with __GFP_UNMAPPED.
>>> + *
>>> + * Under these assumptions, a pagetable might need to be allocated while
>>> + * _unmapping_ stuff from the direct map during a __GFP_UNMAPPED
>>> + * allocation. But, the allocation of that pagetable never requires
>>> + * allocating a further pagetable.
>>> + */
>>> + err = set_direct_map_valid_noflush(page,
>>> + nr_pageblocks << pageblock_order, want_mapped);
>>> + if (err == -ENOMEM || WARN_ONCE(err, "err=%d\n", err)) {
>>> + zone_lock_irqsave(zone, irq_flags);
>>> + __free_one_page(page, page_to_pfn(page), zone,
>>> + alloc_order, freetype, FPI_SKIP_REPORT_NOTIFY);
>>> + zone_unlock_irqrestore(zone, irq_flags);
>>> + return NULL;
>>> + }
>>> +
>>> + if (!want_mapped) {
>>> + unsigned long start = (unsigned long)page_address(page);
>>> + unsigned long end = start + (nr_pageblocks << (pageblock_order + PAGE_SHIFT));
>>> +
>>> + flush_tlb_kernel_range(start, end);
>>> + }
>>> +
>>> + for (int i = 0; i < nr_pageblocks; i++) {
>>> + struct page *block_page = page + (pageblock_nr_pages * i);
>>> +
>>> + set_pageblock_freetype_flags(block_page, freetype_flags(freetype));
>>> + }
>>> +
>>> + if (request_order >= alloc_order)
>>> + return page;
>>> +
>>> + /* Free any remaining pages in the block. */
>>> + zone_lock_irqsave(zone, irq_flags);
>>> + for (unsigned int i = request_order; i < alloc_order; i++) {
>>> + struct page *page_to_free = page + (1 << i);
>>> +
>>> + __free_one_page(page_to_free, page_to_pfn(page_to_free), zone,
>>> + i, freetype, FPI_SKIP_REPORT_NOTIFY);
>>> + }
>>
>> Could expand() be used here?
>
> Hm, good point. It should probably look like what try_to_claim_block()
> does...
>
> Instead of figuring that out right now I'll just say this: if that works
> I'll do it, if I find a reason why it doesn't I will add a comment
> explaining it in the next version.

Sounds good.

> BTW my thinking is that clarity is the only important factor here, I am
> confident that any speedup from this would disappear in the noise of the
> TLB flushing etc. But, if it works then yeah I think it would actually
> be clearer.

Sure clarity is important, but also if we have multiple functions doing
similar thing instead of sharing code, there's a risk a future change to the
more common code will miss the new one, etc.

> Thanks very much for this review, I really appreciate it!