Re: [PATCH V3 2/2] arm64/mm: Reject memory removal that splits a kernel leaf mapping

From: David Hildenbrand (Arm)

Date: Mon Mar 02 2026 - 10:55:16 EST

On 2/24/26 07:24, Anshuman Khandual wrote:
> Linear and vmemmap mappings that get teared down during a memory hot remove

s/teared/torn/ ?

> operation might contain leaf level entries on any page table level. If the
> requested memory range's linear or vmemmap mappings falls within such leaf
> entries, new mappings need to be created for the remaining memory mapped on
> the leaf entry earlier, following standard break before make aka BBM rules.
> But kernel cannot tolerate BBM and hence remapping to fine grained leaves
> would not be possible on systems without BBML2_NOABORT.
>
> Currently memory hot remove operation does not perform such restructuring,
> and so removing memory ranges that could split a kernel leaf level mapping
> need to be rejected.
>
> While memory_hotplug.c does appear to permit hot removing arbitrary ranges
> of memory, the higher layers that drive memory_hotplug (e.g. ACPI, virtio,
> ...) all appear to treat memory as fixed size devices. So it is impossible
> to hot unplug a different amount than was previously hot plugged, and hence
> we should never see a rejection in practice, but adding the check makes us
> robust against a future change.

Is this then really a fix that warrens?

Closes: https://lore.kernel.org/all/aWZYXhrT6D2M-7-N@willie-the-truck/
Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove")

... if nothing is broken?

I can understand the desire for this check, but I am confused about Fixes:

>
> Cc: Catalin Marinas <catalin.marinas@xxxxxxx>
> Cc: Will Deacon <will@xxxxxxxxxx>
> Cc: linux-arm-kernel@xxxxxxxxxxxxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx
> Closes: https://lore.kernel.org/all/aWZYXhrT6D2M-7-N@willie-the-truck/
> Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove")
> Reviewed-by: Ryan Roberts <ryan.roberts@xxxxxxx>
> Suggested-by: Ryan Roberts <ryan.roberts@xxxxxxx>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@xxxxxxx>
> ---
> arch/arm64/mm/mmu.c | 155 ++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 149 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index dfb61d218579..ce0b966f42d1 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -2063,6 +2063,142 @@ void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
> __remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
> }
>
> +
> +static bool addr_splits_kernel_leaf(unsigned long addr)
> +{
> + pgd_t *pgdp, pgd;
> + p4d_t *p4dp, p4d;
> + pud_t *pudp, pud;
> + pmd_t *pmdp, pmd;
> + pte_t *ptep, pte;
> +
> + /*
> + * PGD level:

I consider all these "XXX level:" comments unhelpful. It's mostly there, in
the code already: PGDIR_SIZE, PGDIR_SIZE, PUD_SIZE ... :)

> + *
> + * If addr is PGD_SIZE aligned - already on a leaf boundary

Similarly, repeating that is not particularly helpful.

You could have a single comment at the very top that says

"If the given address points at a the start address of a possible leaf, we
certainly won't split. Otherwise, check if we would actually split a leaf by
traversing the page tables further."

> + */
> + if (ALIGN_DOWN(addr, PGDIR_SIZE) == addr)

if (IS_ALIGNED(addr, PGDIR_SIZE))
return false;

?

> + return false;
> +
> + pgdp = pgd_offset_k(addr);
> + pgd = pgdp_get(pgdp);
> + if (!pgd_present(pgd))
> + return false;

How could we end up with non-present areas in a range we hotplugged earlier?

> +
> + /*
> + * P4D level:
> + *
> + * If addr is P4D_SIZE aligned - already on a leaf boundary
> + */
> + if (ALIGN_DOWN(addr, P4D_SIZE) == addr)
> + return false;
> +
> + p4dp = p4d_offset(pgdp, addr);
> + p4d = p4dp_get(p4dp);
> + if (!p4d_present(p4d))
> + return false;
> +
> + /*
> + * PUD level:
> + *
> + * If addr is PUD_SIZE aligned - already on a leaf boundary
> + */
> + if (ALIGN_DOWN(addr, PUD_SIZE) == addr)
> + return false;
> +
> + pudp = pud_offset(p4dp, addr);
> + pud = pudp_get(pudp);
> + if (!pud_present(pud))
> + return false;
> +
> + if (pud_leaf(pud))
> + return true;
> +
> + /*
> + * CONT_PMD level:
> + *
> + * If addr is CONT_PMD_SIZE aligned - already on a leaf boundary
> + */
> + if (ALIGN_DOWN(addr, CONT_PMD_SIZE) == addr)
> + return false;
> +
> + pmdp = pmd_offset(pudp, addr);
> + pmd = pmdp_get(pmdp);
> + if (!pmd_present(pmd))
> + return false;
> +
> + if (pmd_cont(pmd))
> + return true;
> +
> + /*
> + * PMD level:
> + *
> + * If addr is PMD_SIZE aligned - already on a leaf boundary
> + */
> + if (ALIGN_DOWN(addr, PMD_SIZE) == addr)
> + return false;
> +
> + if (pmd_leaf(pmd))
> + return true;
> +
> + /*
> + * CONT_PTE level:
> + *
> + * If addr is CONT_PTE_SIZE aligned - already on a leaf boundary
> + */
> + if (ALIGN_DOWN(addr, CONT_PTE_SIZE) == addr)
> + return false;
> +
> + ptep = pte_offset_kernel(pmdp, addr);
> + pte = __ptep_get(ptep);
> + if (!pte_present(pte))
> + return false;
> +
> + if (pte_cont(pte))
> + return true;
> +
> + /*
> + * PTE level:
> + *
> + * If addr is PAGE_SIZE aligned - already on a leaf boundary
> + */
> + if (ALIGN_DOWN(addr, PAGE_SIZE) == addr)
> + return false;
> + return true;

return !IS_ALIGNED(addr, PAGE_SIZE);

> +}
> +
> +static bool can_unmap_without_split(unsigned long pfn, unsigned long nr_pages)
> +{
> + unsigned long phys_start, phys_end, size, start, end;
> +
> + phys_start = PFN_PHYS(pfn);
> + phys_end = phys_start + nr_pages * PAGE_SIZE;
> +
> + /*
> + * PFN range's linear map edges are leaf entry aligned
> + */
> + start = __phys_to_virt(phys_start);
> + end = __phys_to_virt(phys_end);
> + if (addr_splits_kernel_leaf(start) || addr_splits_kernel_leaf(end)) {
> + pr_warn("[%lx %lx] splits a leaf entry in linear map\n",
> + phys_start, phys_end);
> + return false;
> + }
> +
> + /*
> + * PFN range's vmemmap edges are leaf entry aligned
> + */
> + size = nr_pages * sizeof(struct page);
> + start = (unsigned long)pfn_to_page(pfn);
> + end = start + size;

As arm64 cannot be used with CONFIG_SPARSEMEM (only with
CONFIG_SPARSEMEM_VMEMMAP, as it sets SPARSEMEM_VMEMMAP_ENABLE),
I think you can just do

start = (unsigned long)pfn_to_page(pfn);
end = (unsigned long)pfn_to_page(pfn + nr_pages);

As pfn_to_page() is just simple calculation that works even if "pfn+nr_pages" does not exist.

> + if (addr_splits_kernel_leaf(start) || addr_splits_kernel_leaf(end)) {
> + pr_warn("[%lx %lx] splits a leaf entry in vmemmap\n",
> + phys_start, phys_end);
> + return false;
> + }
> + return true;
> +}
> +

--
Cheers,

David