Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)

From: Pedro Falcato

Date: Wed Feb 18 2026 - 06:58:28 EST

On Wed, Feb 18, 2026 at 11:46:29AM +0100, David Hildenbrand (Arm) wrote:
> On 2/18/26 11:38, Dev Jain wrote:
> >
> > On 18/02/26 3:36 pm, Pedro Falcato wrote:
> > > On Wed, Feb 18, 2026 at 10:31:19AM +0530, Dev Jain wrote:
> > > > Thanks for working on this. Some comments -
> > > >
> > > > 1. Rejecting batching with pte_batch_hint() means that we also don't batch 16K and 32K large
> > > > folios on arm64, since the cont bit is on starting only at 64K. Not sure how imp this is.
> > > I don't understand what you mean. Is ARM64 doing large folio optimization,
> > > even when there's no special MMU support for it (the aforementioned 16K and
> > > 32K cases)? If so, perhaps it's time for a ARCH_SUPPORTS_PTE_BATCHING flag.
> > > Though if you could provide numbers in that case it would be much appreciated.
> >
> > There are two things at play here:
> >
> > 1. All arches are expected to benefit from pte batching on large folios, because
> > of doing similar operations together in one shot. For code paths except mprotect
> > and mremap, that benefit is far more clear due to:
> >
> > a) batching across atomic operations etc. For example, see copy_present_ptes -> folio_ref_add.
> > Instead of bumping the reference by 1 nr times, we bump it by nr in one shot.
> >
> > b) vm_normal_folio was already being invoked. So, all in all the only new overhead
> > we introduce is of folio_pte_batch(_flags). In fact, since we already have the
> > folio, I recall that we even just special case the large folio case, out from
> > the small folio case. Thus 4K folio processing will have no overhead.
> >
> > 2. Due to the requirements of contpte, ptep_get() on arm64 needs to fetch a/d bits
> > across a cont block. Thus, for each ptep_get, it does 16 pte accesses. To avoid this,
> > it becomes critical to batch on arm64.
> >
> >
> > >
> > > > 2. Did you measure if there is an optimization due to just the first commit ("prefetch the next pte")?
> > > Yes, I could measure a sizeable improvement (perhaps some 5%). I tested on
> > > zen5 (which is a pretty beefy uarch) and the loop is so full of ~~crap~~
> > > features that the prefetcher seems to be doing a poor job, at least per my
> > > results.
> >
> > Nice.
> >
> > >
> > > > I actually had prefetch in mind - is it possible to do some kind of prefetch(pfn_to_page(pte_pfn(pte)))
> > > > to optimize the call to vm_normal_folio()?
> > > Certainly possible, but I suspect it doesn't make too much sense. You want to
> > > avoid bringing in the cacheline if possible. In the pte's case, I know we're
> > > probably going to look at it and modify it, and if I'm wrong it's just one
> > > cacheline we misprefetched (though I had some parallel convos and it might
> > > be that we need a branch there to avoid prefetching out of the PTE table).
> > > We would like to avoid bringing in the folio cacheline at all, even if we
> > > don't stall through some fancy prefetching or sheer CPU magic.
> >
> > I dunno, need other opinions.
>
> Let's repeat my question: what, besides the micro-benchmark in some cases
> with all small-folios, are we trying to optimize here. No hand waving
> (Androids does this or that) please.

I don't understand what you're looking for. an mprotect-based workload? those
obviously don't really exist, apart from something like a JIT engine cranking
out a lot of mprotect() calls in an aggressive fashion. Or perhaps some of that
usage of mprotect that our DB friends like to use sometimes (discussed in
$OTHER_CONTEXTS), though those are generally hugepages.

I don't see how this can justify large performance regressions in a system
call, for something every-architecture-not-named-arm64 does not have.

--
Pedro