Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)

From: David Hildenbrand (Arm)

Date: Wed Feb 18 2026 - 07:24:41 EST


On 2/18/26 12:58, Pedro Falcato wrote:
On Wed, Feb 18, 2026 at 11:46:29AM +0100, David Hildenbrand (Arm) wrote:
On 2/18/26 11:38, Dev Jain wrote:


There are two things at play here:

1. All arches are expected to benefit from pte batching on large folios, because
of doing similar operations together in one shot. For code paths except mprotect
and mremap, that benefit is far more clear due to:

a) batching across atomic operations etc. For example, see copy_present_ptes -> folio_ref_add.
Instead of bumping the reference by 1 nr times, we bump it by nr in one shot.

b) vm_normal_folio was already being invoked. So, all in all the only new overhead
we introduce is of folio_pte_batch(_flags). In fact, since we already have the
folio, I recall that we even just special case the large folio case, out from
the small folio case. Thus 4K folio processing will have no overhead.

2. Due to the requirements of contpte, ptep_get() on arm64 needs to fetch a/d bits
across a cont block. Thus, for each ptep_get, it does 16 pte accesses. To avoid this,
it becomes critical to batch on arm64.



Nice.


I dunno, need other opinions.

Let's repeat my question: what, besides the micro-benchmark in some cases
with all small-folios, are we trying to optimize here. No hand waving
(Androids does this or that) please.

I don't understand what you're looking for. an mprotect-based workload? those
obviously don't really exist, apart from something like a JIT engine cranking
out a lot of mprotect() calls in an aggressive fashion. Or perhaps some of that
usage of mprotect that our DB friends like to use sometimes (discussed in
$OTHER_CONTEXTS), though those are generally hugepages.


Anything besides a homemade micro-benchmark that highlights why we should care about this exact fast and repeated sequence of events.

I'm surprise that such a "large regression" does not show up in any other non-home-made benchmark that people/bots are running. That's really what I am questioning.

Having that said, I'm all for optimizing it if there is a real problem there.

I don't see how this can justify large performance regressions in a system
call, for something every-architecture-not-named-arm64 does not have.
Take a look at the reported performance improvements on AMD with large folios.

The issue really is that small folios don't perform well, on any architecture. But to detect large vs. small folios we need the ... folio.

So once we optimize for small folios (== don't try to detect large folios) we'll degrade large folios.


For fork() and unmap() we were able to avoid most of the performance regressions for small folios by special-casing the implementation on two variants: nr_pages == 1 (incl. small folios) vs. nr_pages != 1 (large folios).

We cannot avoid the vm_normal_folio(). Maybe the function-call overhead could be avoided by providing an inlined variant -- if that is the real problem.

But likely it's also just access to the folio when we really don't need it in some cases.

--
Cheers,

David