Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
From: Pedro Falcato
Date: Mon Feb 16 2026 - 09:56:25 EST
On Mon, Feb 16, 2026 at 03:42:08PM +0530, Dev Jain wrote:
>
> On 13/02/26 10:56 pm, David Hildenbrand (Arm) wrote:
> > On 2/13/26 18:16, Suren Baghdasaryan wrote:
> >> On Fri, Feb 13, 2026 at 4:24 PM Pedro Falcato <pfalcato@xxxxxxx> wrote:
> >>>
> >>> On Fri, Feb 13, 2026 at 04:47:29PM +0100, David Hildenbrand (Arm) wrote:
> >>>>
> >>>> Hi!
> >>>>
> >>>>
> >>>> Micro-benchmark results are nice. But what is the real word impact?
> >>>> IOW, why
> >>>> should we care?
> >>>
> >>> Well, mprotect is widely used in thread spawning, code JITting,
> >>> and even process startup. And we don't want to pay for a feature we can't
> >>> even use (on x86).
> >>
> >> I agree. When I straced Android's zygote a while ago, mprotect() came
> >> up #30 in the list of most frequently used syscalls and one of the
> >> most used mm-related syscalls due to its use during process creation.
> >> However, I don't know how often it's used on VMAs of size >=400KiB.
> >
> > See my point? :) If this is apparently so widespread then finding a real
> > reproducer is likely not a problem. Otherwise it's just speculation.
> >
> > It would also be interesting to know whether the reproducer ran with any
> > sort of mTHP enabled or not.
>
> Yes. Luke, can you experiment with the following microbenchmark:
>
> https://pastebin.com/3hNtYirT
>
> and see if there is an optimization for pte-mapped 2M folios, before and
> after the commit?
>
> (set transparent_hugepages/enabled=always, hugepages-2048Kb/enabled=always)
>
>
> >
> >>
> >>>
> >>> In any case, I think I see the problem. Namely, that we now need to call
> >>> vm_normal_folio() for every single PTE (this seems similar to the mremap
> >>> problem caught in 0b5be138ce00f421bd7cc5a226061bd62c4ab850). I'll try to
> >>> draft up a patch over the weekend if I can.
> >
> > I think we excessively discussed that during review and fixups of the
> > commit in question. You might want to dig through that because I could
> > have sworn we might already have discussed how to optimize this.
>
> I have written a patch to call vm_normal_folio only when required, and use
> pte_batch_hint
>
> instead of vm_normal_folio + folio_pte_batch. The results, testing with
>
> https://pastebin.com/3hNtYirT on Apple M3:
>
> without-thp (small 4K folio case): patched beats vanilla by 6.89% (patched
> avoids vm_normal_folio overhead)
>
For what it's worth, I tried to avoid vm_normal_page() as much as possible
and realized that the code is extremely timing sensitive (perhaps due to
being in a hot loop), thus even a small attempt at writing something that
doesn't offend the eyes (and the soul) will get it much slower.
FWIW my benchmark was something of the sort:
int i = 0;
mmap(400MiB, MAP_POPULATE);
while (do_benchmark()) {
if (i & 1)
mprotect(buf, size, PROT_NONE);
else
mprotect(buf, size, PROT_READ | PROT_WRITE);
i++;
}
probably worth chucking a few "do not thp" calls, which i totally
forgot about. though it didn't seem to be relevant in my testing, somehow.
--
Pedro