Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)

From: David Hildenbrand (Arm)

Date: Wed Feb 18 2026 - 05:46:39 EST

On 2/18/26 11:38, Dev Jain wrote:

On 18/02/26 3:36 pm, Pedro Falcato wrote:

On Wed, Feb 18, 2026 at 10:31:19AM +0530, Dev Jain wrote:

Thanks for working on this. Some comments -

1. Rejecting batching with pte_batch_hint() means that we also don't batch 16K and 32K large
folios on arm64, since the cont bit is on starting only at 64K. Not sure how imp this is.

I don't understand what you mean. Is ARM64 doing large folio optimization,
even when there's no special MMU support for it (the aforementioned 16K and
32K cases)? If so, perhaps it's time for a ARCH_SUPPORTS_PTE_BATCHING flag.
Though if you could provide numbers in that case it would be much appreciated.

There are two things at play here:

1. All arches are expected to benefit from pte batching on large folios, because
of doing similar operations together in one shot. For code paths except mprotect
and mremap, that benefit is far more clear due to:

a) batching across atomic operations etc. For example, see copy_present_ptes -> folio_ref_add.
Instead of bumping the reference by 1 nr times, we bump it by nr in one shot.

b) vm_normal_folio was already being invoked. So, all in all the only new overhead
we introduce is of folio_pte_batch(_flags). In fact, since we already have the
folio, I recall that we even just special case the large folio case, out from
the small folio case. Thus 4K folio processing will have no overhead.

2. Due to the requirements of contpte, ptep_get() on arm64 needs to fetch a/d bits
across a cont block. Thus, for each ptep_get, it does 16 pte accesses. To avoid this,
it becomes critical to batch on arm64.

2. Did you measure if there is an optimization due to just the first commit ("prefetch the next pte")?

Yes, I could measure a sizeable improvement (perhaps some 5%). I tested on
zen5 (which is a pretty beefy uarch) and the loop is so full of ~~crap~~
features that the prefetcher seems to be doing a poor job, at least per my
results.

Nice.

I actually had prefetch in mind - is it possible to do some kind of prefetch(pfn_to_page(pte_pfn(pte)))
to optimize the call to vm_normal_folio()?

Certainly possible, but I suspect it doesn't make too much sense. You want to
avoid bringing in the cacheline if possible. In the pte's case, I know we're
probably going to look at it and modify it, and if I'm wrong it's just one
cacheline we misprefetched (though I had some parallel convos and it might
be that we need a branch there to avoid prefetching out of the PTE table).
We would like to avoid bringing in the folio cacheline at all, even if we
don't stall through some fancy prefetching or sheer CPU magic.

I dunno, need other opinions.

Let's repeat my question: what, besides the micro-benchmark in some cases with all small-folios, are we trying to optimize here. No hand waving (Androids does this or that) please.

--
Cheers,

David