[PATCH RFC 00/19] mm: Add __GFP_UNMAPPED
From: Brendan Jackman
Date: Wed Feb 25 2026 - 11:40:29 EST
.:: What? Why?
This series adds support for efficiently allocating pages that are not
present in the direct map. This is instrumental to two different
immediate goals:
1. This supports the effort to remove guest_memfd memory from the direct
map [0]. One of the challenges faced in that effort has been
efficiently eliminating TLB entries, this series offers a solution to
that problem
2. Address Space Isolation (ASI) [1] also needs an efficient way to
allocate pages that are missing from the direct map. Although for ASI
the needs are slightly different (in that case, the pages need only
be removed from ASI's special pagetables), the most interesting mm
challenges are basically the same.
So, __GFP_UNMAPPED serves as a Trojan horse to get the page allocator
into a state where adding ASI's features "Should Be Easy".
This series _also_ serves as a Trojan horse for the "mermap" (details
below) which is also a key building block for making ASI efficient.
Longer term, there are a wide range of security techniques unlocked by
being able to efficiently remove pages from the kernel's address space.
There may also be non-security usecases for this feature, for example
at LPC Sumit Garg presented an issue with memory-firewalled client
devices that could he remediated by __GFP_UNMAPPED [2].
.:: Design
The key design elements introduced here are just repurposed from
previous attempts to directly introduce ASI's needs to the page
allocator [3]. The only real difference is that now these support
totally unmapping stuff from the direct map, instead of only unmapping
it from ASI's special pagetables.
.:::: Design: Introducing "freetypes"
The biggest challenge for efficiently getting stuff out of the direct
map is TLB flushing. Pushing this problem into the page allocator turns
out to enable amortising that flush cost into almost nothing. The core
idea is to have pools of already-unmapped pages. We'd like those pages
to be physically contiguous so they don't unduly fragment the pagetables
around them, and we'd like to be able to efficiently look up these
already-unmapped pages during allocation. The page allocator already has
deeply-ingrained functionality for physically grouping pages by a
certain attribute, and then indexing free pages by that attribute, this
mechanism is: migratetypes.
So basically, this series extends the concepts of migratetypes in the
allocator so that as well as just representing mobility, they can
represent other properties of the page too. (Actually, migratetypes are
already sort of overloaded, but the main extension is to be able to
represent _orthogonal_ properties). In order to avoid further
overloading the concept of a migratetype, this extension is done by
adding a new concept on top of migratetype: the _freetype_. A freetype
is basically just a migratetype plus some flags, and it replaces
migratetypes wherever the latter is currently used as to index free
pages.
The first freetype flag is then added, which marks the pages it indexes
as being absent from the direct map. This is then used to implement the
new __GFP_UNMAPPED flag, which allocates pages from pageblocks that have
the new flag, or unmaps pages if no existing ones are already available.
.:::: Design: Introducing the "mermap"
Sharp readers might by now be asking how __GFP_UNMAPPED interacts with
__GFP_ZERO. If pages aren't in the direct map, how can the page
allocator zero them? The solution is the "mermap", short for "epheMERal
mapping". The mermap provides an efficient way to temporarily map pages
into the local address space, and the allocator uses these mappings to
zero pages.
Using the mermap securely requires some knowledge about the usage of the
pages. One slightly awkward part of this design is that the page
allocator's usage of the mermap then "leaks" out so that callers who
allocate with __GFP_UNMAPPED|__GFP_ZERO need to be aware of the mermap's
security implications. For the guest_memfd unmapping usecase, that means
when guest_memfd.c makes these special allocations, it is only safe
because the pages will belong to the current process. In other words,
the use of the mermap potentially allows that process to leak the pages
via CPU sidechannels (unless more holistic/expensive mitigations are
enabled).
Since this cover letter is already too long I won't describe most
details of the mermap here, please see the patch that introduces it.
But one key detail is that it requires a kernel-space but mm-local
virtual address region. So... this series adds that too (for x86). This
is called the mm-local region and is implemented by "just" extending and
generalising the LDT remap area.
.:: Outline of the patchset
- Patches 1 -> 2 introduce the mm-local region for x86
- Patches 3 -> 5 introduce the mermap
- Patches 6 -> 14 introduce freetypes
- Patch 8 in particular is the big annoying switch-over which changes
a whole bunch of code from "migratetype" to "freetype". In order to
try and have the compiler help out with catching bugs, this is done
with an annoying typedef. I'm sorry that this patch is so annoying,
but I think if we do want to extend the allocator along these lines
then a typedef + big annoying patch is probably the safest way.
- Patches 15 -> 20 introduce __GFP_UNMAPPED
.:: Why [RFC]?
I really wanted to stop sending RFC and start sending PATCHes but
getting this series out has taken months longer than I expected, so it's
time to get something on the list. The known issues here are:
1. __GFP_UNMAPPED isn't useful yet until guest_memfd unmapping support
[0] gets merged.
2. Apparently while implementing the mm-local region, I totally forgot
that KPTI existed on 32-bit systems. I expect the 0-day bot to fire a
failure on that patch.
There is also one really nasty hack in mermap.c, namely
set_unmapped_pte(). This is basically a symptom of the problem I
propose to discuss at LSF/MM/BPF [3], i.e. the fact that there are
lots of pagetable libraries yet none of them are flexible enough to do
anything new (in this case the "new thing" is pre-allocating pagetables
then subsequently populating them in a separate context). Whether this
particular hack should block merging the mermap is not clear to me, I'd
be interested to hear opinions.
.:: Performance
In [4] is a branch containing:
1. This series.
2. All the key kernel patches from the Firecracker team's "secret-free"
effort, which includes guest_memfd unmapping ([0]).
3. Some prototype patches to switch guest_memfd over from an ad-hoc
unmapping logic to use of __GFP_UNMAPPED (plus direct use of the
mermap to implement write()).
I benchmarked this using Firecracker's own performance tests [4], which
measure the time required to populate the VM guest's memory. This
population happens via write() so it exercises the mermap. I ran this on
a Sapphire Rapids machine [5]. The baseline here is just the secret-free
patches on their own. "gfp_unmapped" is the branch described above.
"skip-flush" provides a reference against an implementation that just
skips flushing the TLB when unmapping guest_memfd pages, which serves as
an upper-bound on performance.
metric: populate_latency (ms) | test: firecracker-perf-tests-wrapped
+---------------+---------+----------+----------+------------------------+----------+--------+
| nixos_variant | samples | mean | min | histogram | max | Δμ |
+---------------+---------+----------+----------+------------------------+----------+--------+
| | 30 | 1.04s | 1.02s | █ | 1.10s | |
| gfp_unmapped | 30 | 313.02ms | 299.48ms | █ | 343.25ms | -70.0% |
| skip-flush | 30 | 325.80ms | 307.91ms | █ | 333.30ms | -68.8% |
+---------------+---------+----------+----------+------------------------+----------+--------+
Conclusion: it's close to the best case performance for this particular
workload. (Note in the sample above the mean is actually faster - that's
noise, this isn't a consistent observation).
[0] [PATCH v10 00/15] Direct Map Removal Support for guest_memfd
https://lore.kernel.org/all/20260126164445.11867-1-kalyazin@xxxxxxxxxx/
[1] https://linuxasi.dev/
[2] https://lpc.events/event/19/contributions/2095/
[3] https://lore.kernel.org/all/20260219175113.618562-1-jackmanb@xxxxxxxxxx/
[4] https://github.com/bjackman/kernel-benchmarks-nix/blob/fd56c93344760927b71161368230a15741a5869f/packages/benchmarks/firecracker-perf-tests/firecracker-perf-tests.sh
[5] https://github.com/bjackman/aethelred/blob/eb0dd0e99ee08fa0534733113e93b89499affe91
Cc: linux-mm@xxxxxxxxx
Cc: linux-kernel@xxxxxxxxxxxxxxx
Cc: x86@xxxxxxxxxx
Cc: rppt@xxxxxxxxxx
Cc: Sumit Garg <sumit.garg@xxxxxxxxxxxxxxxx>
To: Borislav Petkov <bp@xxxxxxxxx>
To: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
To: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
To: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
To: David Hildenbrand <david@xxxxxxxxxx>
To: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
To: Vlastimil Babka <vbabka@xxxxxxxxxx>
To: Mike Rapoport <rppt@xxxxxxxxxx>
To: Wei Xu <weixugc@xxxxxxxxxx>
To: Johannes Weiner <hannes@xxxxxxxxxxx>
To: Zi Yan <ziy@xxxxxxxxxx>
Cc: yosryahmed@xxxxxxxxxx
Cc: derkling@xxxxxxxxxx
Cc: reijiw@xxxxxxxxxx
Cc: Will Deacon <will@xxxxxxxxxx>
Cc: rientjes@xxxxxxxxxx
Cc: "Kalyazin, Nikita" <kalyazin@xxxxxxxxxxxx>
Cc: patrick.roy@xxxxxxxxx
Cc: "Itazuri, Takahiro" <itazur@xxxxxxxxxxxx>
Cc: Andy Lutomirski <luto@xxxxxxxxxx>
Cc: David Kaplan <david.kaplan@xxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxx>
Signed-off-by: Brendan Jackman <jackmanb@xxxxxxxxxx>
---
Brendan Jackman (19):
x86/mm: split out preallocate_sub_pgd()
x86/mm: Generalize LDT remap into "mm-local region"
x86/tlb: Expose some flush function declarations to modules
x86/mm: introduce the mermap
mm: KUnit tests for the mermap
mm: introduce for_each_free_list()
mm/page_alloc: don't overload migratetype in find_suitable_fallback()
mm: introduce freetype_t
mm: move migratetype definitions to freetype.h
mm: add definitions for allocating unmapped pages
mm: rejig pageblock mask definitions
mm: encode freetype flags in pageblock flags
mm/page_alloc: remove ifdefs from pindex helpers
mm/page_alloc: separate pcplists by freetype flags
mm/page_alloc: rename ALLOC_NON_BLOCK back to _HARDER
mm/page_alloc: introduce ALLOC_NOBLOCK
mm/page_alloc: implement __GFP_UNMAPPED allocations
mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations
mm: Minimal KUnit tests for some new page_alloc logic
Documentation/arch/x86/x86_64/mm.rst | 4 +-
arch/x86/Kconfig | 3 +
arch/x86/include/asm/mermap.h | 23 +
arch/x86/include/asm/mmu_context.h | 71 ++-
arch/x86/include/asm/pgalloc.h | 33 ++
arch/x86/include/asm/pgtable_64_types.h | 19 +-
arch/x86/include/asm/pgtable_types.h | 2 +
arch/x86/include/asm/tlbflush.h | 43 +-
arch/x86/kernel/ldt.c | 137 ++----
arch/x86/mm/init_64.c | 44 +-
arch/x86/mm/pgtable.c | 3 +
include/linux/freetype.h | 147 ++++++
include/linux/gfp.h | 25 +-
include/linux/gfp_types.h | 26 ++
include/linux/mermap.h | 63 +++
include/linux/mermap_types.h | 43 ++
include/linux/mm.h | 13 +
include/linux/mm_types.h | 6 +
include/linux/mmzone.h | 84 ++--
include/linux/pageblock-flags.h | 16 +-
include/trace/events/mmflags.h | 9 +-
kernel/fork.c | 6 +
kernel/panic.c | 2 +
kernel/power/snapshot.c | 8 +-
mm/Kconfig | 41 ++
mm/Makefile | 3 +
mm/compaction.c | 36 +-
mm/init-mm.c | 3 +
mm/internal.h | 43 +-
mm/mermap.c | 323 +++++++++++++
mm/mm_init.c | 11 +-
mm/page_alloc.c | 782 +++++++++++++++++++++++---------
mm/page_isolation.c | 2 +-
mm/page_owner.c | 7 +-
mm/page_reporting.c | 4 +-
mm/pgalloc-track.h | 6 +
mm/show_mem.c | 4 +-
mm/tests/mermap_kunit.c | 231 ++++++++++
mm/tests/page_alloc_kunit.c | 250 ++++++++++
39 files changed, 2099 insertions(+), 477 deletions(-)
---
base-commit: 44982d352c33767cd8d19f8044e7e1161a587ff7
change-id: 20260112-page_alloc-unmapped-944fe5d7b55c
Best regards,
--
Brendan Jackman <jackmanb@xxxxxxxxxx>