Re: [v5 PATCH] arm64: mm: show direct mapping use in /proc/meminfo

From: Yang Shi

Date: Tue Jan 27 2026 - 19:50:46 EST

On 1/27/26 12:57 AM, Ryan Roberts wrote:

On 26/01/2026 20:50, Yang Shi wrote:

On 1/26/26 10:58 AM, Will Deacon wrote:

On Mon, Jan 26, 2026 at 09:55:06AM -0800, Yang Shi wrote:

On 1/26/26 6:14 AM, Will Deacon wrote:

On Thu, Jan 22, 2026 at 01:59:54PM -0800, Yang Shi wrote:

On 1/22/26 6:43 AM, Ryan Roberts wrote:

On 21/01/2026 22:44, Yang Shi wrote:

On 1/21/26 9:23 AM, Ryan Roberts wrote:

But it looks like all the higher level users will only ever unplug in the
same
granularity that was plugged in (I might be wrong but that's the sense I
get).

arm64 adds the constraint that it won't unplug any memory that was present at
boot - see prevent_bootmem_remove_notifier().

So in practice this is probably safe, though perhaps brittle.

Some options:

    - leave it as is and worry about it if/when something shifts and hits the
      problem.

Seems like the most simple way :-)

    - Enhance prevent_bootmem_remove_notifier() to reject unplugging
memory blocks
      whose boundaries are within leaf mappings.

I don't quite get why we should enhance prevent_bootmem_remove_notifier().
If I read the code correctly, it just simply reject offline boot memory.
Offlining a single memory block is fine. If you check the boundaries there,
will it prevent from offlining a single memory block?

I think you need enhance try_remove_memory(). But kernel may unmap linear
mapping by memory blocks if altmap is used. So you should need an extra page
table walk with the start and the size of unplugged dimm before removing the
memory to tell whether the boundaries are within leaf mappings or not IIUC.
Can it be done in arch_remove_memory()? It seems not because
arch_remove_memory() may be called on memory block granularity if altmap is
used.

    - For non-bbml2_noabort systems, map hotplug memory with a new flag to
ensure
      that leaf mappings are always <= memory_block_size_bytes(). For
      bbml2_noabort, split at the block boundaries before doing the
unmapping.

The linear mapping will be at most 128M (4K page size), it sounds sub
optimal IMHO.

Given I don't think this can happen in practice, probably the middle
option is
the best? There is no runtime impact and it will give us a warning if it ever
does happen in future.

What do you think?

I agree it can't happen in practice, so why not just take option #1 given
the complexity added by option #2?

It still looks broken in the case that a region that was mapped with the
contiguous bit is then unmapped. The sequence seems to iterate over
each contiguous PTE, zapping the entry and doing the TLBI while the
other entries in the contiguous range remain intact. I don't think
that's sufficient to guarantee that you don't have stale TLB entries
once you've finished processing the whole range.

For example, imagine you have an L1 TLB that only supports 4k entries
and an L2 TLB that supports 64k entries. Let's say that the contiguous
range is mapped by pte0 ... pte15 and we've zapped and invalidated
pte0 ... pte14. At that point, I think the hardware is permitted to use
the last remaining contiguous pte (pte15) to allocate a 64k entry in the
L2 TLB covering the whole range. A (speculative) walk via one of the
virtual addresses translated by pte0 ... pte14 could then hit that entry
and fill a 4k entry into the L1 TLB. So at the end of the sequence, you
could presumably still access the first 60k of the range thanks to stale
entries in the L1 TLB?

It is a little bit hard for me to understand how come a (speculative) walk
could happen when we reach here.

Before we reach here, IIUC kernel has:

  * offlined all the page blocks. It means they are freed and isolated from
buddy allocator, even pfn walk (for example, compaction) should not reach
them at all.
  * vmemmap has been eliminated. So no struct page available.

From kernel point of view, they are nonreachable now. Did I miss and/or
misunderstand something?

I'm talking about hardware speculation. It's mapped as normal memory so
the CPU can speculate from it. We can't really reason about the bounds
of that, especially in a world with branch predictors and history-based
prefetchers.

OK. If it could happen, I think the suggestions from you and Ryan should work IIUC:

Clear all the entries in the cont range, then invalidate TLB for the whole range.

I can come up with a patch or Ryan would like to take it?

Hi,

There are 2 separate issues that have been raised here and I think we are
conflating them a bit...

1: The contiguous range teardown + tlbi issue that Will raised. That is
definitely a problem and needs to be fixed. (though I think prior to the BBML2
dynamic linear block mapping support it would be rare in practice; probably it
would only affect cont-pmd mappings for 16K and 64K base page configs. With
BBML2 dynamic linear block mapping support, this can happen for contiguous
mappings at all levels with all base page sizes).

I roughed out a patch to hoist out the tlbis and issue as a single range after
clearing all the pgtable entries. I think this will be MUCH faster and will
solve the contiguous issue too. The one catch is that this only works for linear
map and the same helpers are used for the vmemmap. For the latter we also free
the memory, so the tlbis need to happen before the freeing. But vmemmap doesn't
use contiguous mappings so I've added a warning checking that and use a
different scheme based on whether we are freeing or not.

Anshuman has kindly agreed to knock the patch into shape and do the testing.
Hopefully he can post shortly.

2: hot-unplugging a range that starts or terminates in the middle of a large
leaf mapping. The low level hot-unplug implementation allows unplugging any
range of memory as long as it is section size aligned (128M). So theoretically
you could have a 1G PUD leaf mapping and try to unplug 128M from the middle of
it. In practice this doesn't happen because all the users of the hot-unplug code
group memory into devices. If you add a range, you can only remove that same
range. When adding, we will guarrantee that the leaf mappings exactly map the
range, so the same guarrantee can be given for hot-remove.

BUT, that feels fragile to me. I'd like to add a check in
prevent_bootmem_remove_notifier() to ensure that the proposed unplug range is
exactly covered by leaf mappings, and if it isn't, warn and reject. This will
allow us to fail safe for a tiny amount of overhead (which will be made up for
many, many times over by hoisting the tlbis batching the barriers in 1.).

Anshuman has also kindly agreed to put a patch together for that.

Thanks for the update. Look forward to seeing the patches from Anshuman soon.

Thanks,
Yang

Thanks,
Ryan

Thanks,
Yang

Will