Re: [PATCH V3 2/2] arm64/mm: Reject memory removal that splits a kernel leaf mapping

From: Anshuman Khandual

Date: Mon Mar 02 2026 - 23:02:54 EST

On 02/03/26 9:21 PM, David Hildenbrand (Arm) wrote:
> On 2/24/26 07:24, Anshuman Khandual wrote:
>> Linear and vmemmap mappings that get teared down during a memory hot remove
>
> s/teared/torn/ ?

Done !

>
>> operation might contain leaf level entries on any page table level. If the
>> requested memory range's linear or vmemmap mappings falls within such leaf
>> entries, new mappings need to be created for the remaining memory mapped on
>> the leaf entry earlier, following standard break before make aka BBM rules.
>> But kernel cannot tolerate BBM and hence remapping to fine grained leaves
>> would not be possible on systems without BBML2_NOABORT.
>>
>> Currently memory hot remove operation does not perform such restructuring,
>> and so removing memory ranges that could split a kernel leaf level mapping
>> need to be rejected.
>>
>> While memory_hotplug.c does appear to permit hot removing arbitrary ranges
>> of memory, the higher layers that drive memory_hotplug (e.g. ACPI, virtio,
>> ...) all appear to treat memory as fixed size devices. So it is impossible
>> to hot unplug a different amount than was previously hot plugged, and hence
>> we should never see a rejection in practice, but adding the check makes us
>> robust against a future change.
>
> Is this then really a fix that warrens?
>
> Closes: https://lore.kernel.org/all/aWZYXhrT6D2M-7-N@willie-the-truck/
> Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove")
>
> ... if nothing is broken?
>
> I can understand the desire for this check, but I am confused about Fixes:

Probably nothing is broken now I guess but the original memory hot remove
patch should have taken care of this scenario. Although don't have strong
opinions either way. We could drop both "Fixes" and "Closes" tags here if
that is preferred.

>
>
>>
>> Cc: Catalin Marinas <catalin.marinas@xxxxxxx>
>> Cc: Will Deacon <will@xxxxxxxxxx>
>> Cc: linux-arm-kernel@xxxxxxxxxxxxxxxxxxx
>> Cc: linux-kernel@xxxxxxxxxxxxxxx
>> Closes: https://lore.kernel.org/all/aWZYXhrT6D2M-7-N@willie-the-truck/
>> Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove")
>> Reviewed-by: Ryan Roberts <ryan.roberts@xxxxxxx>
>> Suggested-by: Ryan Roberts <ryan.roberts@xxxxxxx>
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@xxxxxxx>
>> ---
>> arch/arm64/mm/mmu.c | 155 ++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 149 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index dfb61d218579..ce0b966f42d1 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -2063,6 +2063,142 @@ void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
>> __remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
>> }
>>
>> +
>> +static bool addr_splits_kernel_leaf(unsigned long addr)
>> +{
>> + pgd_t *pgdp, pgd;
>> + p4d_t *p4dp, p4d;
>> + pud_t *pudp, pud;
>> + pmd_t *pmdp, pmd;
>> + pte_t *ptep, pte;
>> +
>> + /*
>> + * PGD level:
>
> I consider all these "XXX level:" comments unhelpful. It's mostly there, in
> the code already: PGDIR_SIZE, PGDIR_SIZE, PUD_SIZE ... :)

Got it.

>
>> + *
>> + * If addr is PGD_SIZE aligned - already on a leaf boundary
>
> Similarly, repeating that is not particularly helpful.
>
> You could have a single comment at the very top that says
>
> "If the given address points at a the start address of a possible leaf, we
> certainly won't split. Otherwise, check if we would actually split a leaf by
> traversing the page tables further."

Will drop all the existing comments for each level and instead add the above
unified comment as suggested.

>
>> + */
>> + if (ALIGN_DOWN(addr, PGDIR_SIZE) == addr)
>
> if (IS_ALIGNED(addr, PGDIR_SIZE))
> return false;
>
> ?

Looks cleaner, will convert all ALIGN_DOWN() level checks into IS_ALIGNED().

>
>> + return false;
>> +
>> + pgdp = pgd_offset_k(addr);
>> + pgd = pgdp_get(pgdp);
>> + if (!pgd_present(pgd))
>> + return false;
>
> How could we end up with non-present areas in a range we hotplugged earlier?

We might not in reality but in order to be sure just an additional protection.

>
>> +
>> + /*
>> + * P4D level:
>> + *
>> + * If addr is P4D_SIZE aligned - already on a leaf boundary
>> + */
>> + if (ALIGN_DOWN(addr, P4D_SIZE) == addr)
>> + return false;
>> +
>> + p4dp = p4d_offset(pgdp, addr);
>> + p4d = p4dp_get(p4dp);
>> + if (!p4d_present(p4d))
>> + return false;
>> +
>> + /*
>> + * PUD level:
>> + *
>> + * If addr is PUD_SIZE aligned - already on a leaf boundary
>> + */
>> + if (ALIGN_DOWN(addr, PUD_SIZE) == addr)
>> + return false;
>> +
>> + pudp = pud_offset(p4dp, addr);
>> + pud = pudp_get(pudp);
>> + if (!pud_present(pud))
>> + return false;
>> +
>> + if (pud_leaf(pud))
>> + return true;
>> +
>> + /*
>> + * CONT_PMD level:
>> + *
>> + * If addr is CONT_PMD_SIZE aligned - already on a leaf boundary
>> + */
>> + if (ALIGN_DOWN(addr, CONT_PMD_SIZE) == addr)
>> + return false;
>> +
>> + pmdp = pmd_offset(pudp, addr);
>> + pmd = pmdp_get(pmdp);
>> + if (!pmd_present(pmd))
>> + return false;
>> +
>> + if (pmd_cont(pmd))
>> + return true;
>> +
>> + /*
>> + * PMD level:
>> + *
>> + * If addr is PMD_SIZE aligned - already on a leaf boundary
>> + */
>> + if (ALIGN_DOWN(addr, PMD_SIZE) == addr)
>> + return false;
>> +
>> + if (pmd_leaf(pmd))
>> + return true;
>> +
>> + /*
>> + * CONT_PTE level:
>> + *
>> + * If addr is CONT_PTE_SIZE aligned - already on a leaf boundary
>> + */
>> + if (ALIGN_DOWN(addr, CONT_PTE_SIZE) == addr)
>> + return false;
>> +
>> + ptep = pte_offset_kernel(pmdp, addr);
>> + pte = __ptep_get(ptep);
>> + if (!pte_present(pte))
>> + return false;
>> +
>> + if (pte_cont(pte))
>> + return true;
>> +
>> + /*
>> + * PTE level:
>> + *
>> + * If addr is PAGE_SIZE aligned - already on a leaf boundary
>> + */
>> + if (ALIGN_DOWN(addr, PAGE_SIZE) == addr)
>> + return false;
>> + return true;
>
> return !IS_ALIGNED(addr, PAGE_SIZE);
>
>> +}
>> +
>> +static bool can_unmap_without_split(unsigned long pfn, unsigned long nr_pages)
>> +{
>> + unsigned long phys_start, phys_end, size, start, end;
>> +
>> + phys_start = PFN_PHYS(pfn);
>> + phys_end = phys_start + nr_pages * PAGE_SIZE;
>> +
>> + /*
>> + * PFN range's linear map edges are leaf entry aligned
>> + */
>> + start = __phys_to_virt(phys_start);
>> + end = __phys_to_virt(phys_end);
>> + if (addr_splits_kernel_leaf(start) || addr_splits_kernel_leaf(end)) {
>> + pr_warn("[%lx %lx] splits a leaf entry in linear map\n",
>> + phys_start, phys_end);
>> + return false;
>> + }
>> +
>> + /*
>> + * PFN range's vmemmap edges are leaf entry aligned
>> + */
>> + size = nr_pages * sizeof(struct page);
>> + start = (unsigned long)pfn_to_page(pfn);
>> + end = start + size;
>
> As arm64 cannot be used with CONFIG_SPARSEMEM (only with
> CONFIG_SPARSEMEM_VMEMMAP, as it sets SPARSEMEM_VMEMMAP_ENABLE),
> I think you can just do
>
> start = (unsigned long)pfn_to_page(pfn);
> end = (unsigned long)pfn_to_page(pfn + nr_pages);
>
> As pfn_to_page() is just simple calculation that works even if "pfn+nr_pages" does not exist.

Sure, will drop the element 'size' and derive 'end' as suggested above via pfn_to_page().

>
>> + if (addr_splits_kernel_leaf(start) || addr_splits_kernel_leaf(end)) {
>> + pr_warn("[%lx %lx] splits a leaf entry in vmemmap\n",
>> + phys_start, phys_end);
>> + return false;
>> + }
>> + return true;
>> +}
>> +
>