Re: [PATCH RFC 04/19] x86/mm: introduce the mermap
From: Brendan Jackman
Date: Fri Feb 27 2026 - 05:55:50 EST
Relaying some code review from an AI that I wasn't able to run before
sending...
(This isn't the AI's verbatim output I'm filtering it and rephrasing).
On Wed Feb 25, 2026 at 4:34 PM UTC, Brendan Jackman wrote:
> +
> +/* Call with migration disabled. */
> +static inline struct mermap_alloc *mermap_alloc(struct mm_struct *mm,
> + unsigned long size, bool use_reserve)
> +{
> + int cpu = raw_smp_processor_id();
> + struct mermap_cpu *mc = this_cpu_ptr(mm->mermap.cpu);
> + unsigned long cpu_end = mermap_cpu_end(cpu);
> + struct mermap_alloc *alloc = NULL;
> +
> + /*
> + * This is an extremely stupid allocator, there can only ever be a small
> + * number of allocations so everything just works on linear search.
> + *
> + * Allocations are "in order", i.e. if the whole region is free it
> + * allocates from the beginning. If there are any existing allocations
> + * it allocates from right after the last (highest address) one. Any
> + * free space before that goes unused.
> + *
> + * Once an allocation has been freed, the space it occupied must be flushed
> + * from the TLB before it can be reused.
> + *
> + * Visual example of how this is suppose to behave (A for allocated, T for
> + * TLB-flush-pending):
> + *
> + * _______________ Start with everything free.
> + * AaaA___________ Allocate something.
> + * TttT___________ Free it. (Region needs a TLB flush now).
> + * TttTAaaaaaaaA__ Allocate something else.
> + * TttTAaaaaaaaAAA Allocate the remaining space.
> + * TttTTtttttttTAA Free the allocation before last.
> + * ^^^^^^^^^^^^^ This could all be reused now but for simplicity it
> + * isn't. Another allocation at this point will fail.
> + * TttTTtttttttTTT Free the last allocation.
> + * _______________ Next time we allocate, first flush the TLB
> + * AA_____________ Now we're back at the beginning.
> + */
> +
> + if (use_reserve) {
> + if (WARN_ON_ONCE(size != PAGE_SIZE))
> + return NULL;
> + lockdep_assert_preemption_disabled();
> + } else {
> + cpu_end -= PAGE_SIZE;
> + }
> +
> + if (WARN_ON_ONCE(!in_task()))
> + return NULL;
> + guard(preempt)();
> +
> + /* Out of already-available space? */
> + if (mc->next_addr + size > cpu_end) {
> + unsigned long new_next = mermap_cpu_base(cpu);
> +
> + /* Would we have space after a TLB flush? */
> + for (int i = 0; i < ARRAY_SIZE(mc->allocs); i++) {
> + struct mermap_alloc *alloc = &mc->allocs[i];
> +
> + /*
> + * The space between the uppermost allocated alloc->end
> + * (or the base of the CPU's region if there are no
> + * current allocations) and mc->next_addr has been
> + * unmapped in the pagetables, but not flushed from the
> + * TLB. Set new_next to point to the beginning of that
> + * space.
> + */
> + if (READ_ONCE(alloc->in_use))
> + new_next = max(new_next, alloc->end);
> + }
> + if (size > cpu_end - new_next)
> + return NULL;
> +
> + mermap_flush_tlb(cpu, mc);
> + mc->next_addr = new_next;
> + }
> +
> + /* Find an alloc-tracking structure to use */
> + for (int i = 0; i < ARRAY_SIZE(mc->allocs); i++) {
> + if (!READ_ONCE(mc->allocs[i].in_use)) {
> + alloc = &mc->allocs[i];
> + break;
> + }
> + }
> + if (!alloc)
> + return NULL;
Oops, I forgot to account for @use_reserve here. The alloc-tracking
structures should have a reservation like how the virtual address space
does otherwise allocations can fail where they aren't supposed to.
> + alloc->in_use = true;
> + alloc->base = mc->next_addr;
> + alloc->end = alloc->base + size;
> + mc->next_addr += size;
> +
> + return alloc;
> +}
> +
> +struct set_pte_ctx {
> + pgprot_t prot;
> + unsigned long next_pfn;
> +};
> +
> +static inline int do_set_pte(pte_t *pte, unsigned long addr, void *data)
> +{
> + struct set_pte_ctx *ctx = data;
> +
> + set_pte(pte, pfn_pte(ctx->next_pfn, ctx->prot));
> + ctx->next_pfn++;
> +
> + return 0;
> +}
> +
> +static struct mermap_alloc *
> +__mermap_get(struct mm_struct *mm, struct page *page,
> + unsigned long size, pgprot_t prot, bool use_reserve)
> +{
> + struct mermap_alloc *alloc = NULL;
> + struct set_pte_ctx ctx;
> + int err;
> +
> + if (size > MERMAP_CPU_REGION_SIZE || WARN_ON_ONCE(!mm || !mm->mermap.cpu))
> + return NULL;
> + if (WARN_ON_ONCE(!arch_mermap_pgprot_allowed(prot)))
> + return NULL;
> +
> + size = PAGE_ALIGN(size);
> +
> + migrate_disable();
> +
> + alloc = mermap_alloc(mm, size, use_reserve);
> + if (!alloc) {
> + migrate_enable();
> + return NULL;
> + }
> +
> + /* This probably wants to be optimised. */
> + ctx.prot = prot;
> + ctx.next_pfn = page_to_pfn(page);
> + err = apply_to_existing_page_range(mm, alloc->base, size, do_set_pte, &ctx);
This takes a PTE lock, and we may have preemption off here, so this may be
broken on PREEMPT_RT? Haven't checked. (Maybe I can just test this by
running it with PREEMPT_RT and lockdep enabled?)
If that's indeed broken, this is yet another point to discuss in about
requirements for a pagetable library [0]. The lock is not needed at all
here - we need a way to modify pagetables that lets you take advantage
of higher-level synchronization.
[0] https://lore.kernel.org/all/20260219175113.618562-1-jackmanb@xxxxxxxxxx/
> + if (err) {
> + WRITE_ONCE(alloc->in_use, false);
> + return NULL;
Forgot migrate_enable().
(Is there a way to prevent this with the lovely new
__attribute__((cleanup)) magic?)
> + }
> +
> + return alloc;
> +}
> +
> +/*
> + * Allocate a region of virtual memory, and map the page into it. This tries
> + * pretty hard to be fast but doesn't try very hard at all to actually succeed.
> + *
> + * The returned region is physically local to the current mm. It is _logically_
> + * local to the current CPU but this is not enforced by hardware so it can be
> + * exploited to mitigate CPU vulns. This means the caller must not map memory
> + * here that doesn't belong to the current process. The caller must also perform
> + * a full TLB flush of the region before freeing the pages that have been mapped
> + * here.
> + *
> + * This may only be called from process context, and the caller must arrange to
> + * first call mermap_mm_prepare(). (It would be possible to support this in IRQ,
> + * but it seems unlikely there's a valid usecase given the TLB flushing
> + * requirements). If it succeeds, it disables migration until you call
> + * mermap_put().
> + *
> + * This is guaranteed not to allocate.
This one isn't from AI, but I just realised that's a pretty confusing
thing for an allocator to say. It should say it's guaranteed not to call
the page allocator. (This is important coz I want to use it from the
page allocator).
> + *
> + * Use mermap_addr() to get the actual address of the mapped region.
> + */
> +struct mermap_alloc *mermap_get(struct page *page, unsigned long size, pgprot_t prot)
> +{
> + return __mermap_get(current->mm, page, size, prot, false);
> +}
> +EXPORT_SYMBOL(mermap_get);
> +
> +/*
> + * Allocate a single PAGE_SIZE page via mermap_get(), requiring preemption to be
> + * off until it is freed. This always succeeds.
> + */
> +void *mermap_get_reserved(struct page *page, pgprot_t prot)
Oops, that should return mermap_alloc *.
> +/* Clean up mermap stuff on mm teardown. */
> +void mermap_mm_teardown(struct mm_struct *mm)
> +{
> + int cpu;
> +
> + if (!mm->mermap.cpu)
> + return;
> +
> + for_each_possible_cpu(cpu) {
> + struct mermap_cpu *mc = this_cpu_ptr(mm->mermap.cpu);
Oops, this should be per_cpu_ptr(..., cpu) or whatever it is.