Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)

From: Dev Jain

Date: Mon Feb 16 2026 - 05:12:44 EST

On 13/02/26 10:56 pm, David Hildenbrand (Arm) wrote:
> On 2/13/26 18:16, Suren Baghdasaryan wrote:
>> On Fri, Feb 13, 2026 at 4:24 PM Pedro Falcato <pfalcato@xxxxxxx> wrote:
>>>
>>> On Fri, Feb 13, 2026 at 04:47:29PM +0100, David Hildenbrand (Arm) wrote:
>>>>
>>>> Hi!
>>>>
>>>>
>>>> Micro-benchmark results are nice. But what is the real word impact?
>>>> IOW, why
>>>> should we care?
>>>
>>> Well, mprotect is widely used in thread spawning, code JITting,
>>> and even process startup. And we don't want to pay for a feature we can't
>>> even use (on x86).
>>
>> I agree. When I straced Android's zygote a while ago, mprotect() came
>> up #30 in the list of most frequently used syscalls and one of the
>> most used mm-related syscalls due to its use during process creation.
>> However, I don't know how often it's used on VMAs of size >=400KiB.
>
> See my point? :) If this is apparently so widespread then finding a real
> reproducer is likely not a problem. Otherwise it's just speculation.
>
> It would also be interesting to know whether the reproducer ran with any
> sort of mTHP enabled or not.

Yes. Luke, can you experiment with the following microbenchmark:

https://pastebin.com/3hNtYirT

and see if there is an optimization for pte-mapped 2M folios, before and
after the commit?

(set transparent_hugepages/enabled=always, hugepages-2048Kb/enabled=always)

>
>>
>>>
>>> In any case, I think I see the problem. Namely, that we now need to call
>>> vm_normal_folio() for every single PTE (this seems similar to the mremap
>>> problem caught in 0b5be138ce00f421bd7cc5a226061bd62c4ab850). I'll try to
>>> draft up a patch over the weekend if I can.
>
> I think we excessively discussed that during review and fixups of the
> commit in question. You might want to dig through that because I could
> have sworn we might already have discussed how to optimize this.

I have written a patch to call vm_normal_folio only when required, and use
pte_batch_hint

instead of vm_normal_folio + folio_pte_batch. The results, testing with

https://pastebin.com/3hNtYirT on Apple M3:

without-thp (small 4K folio case): patched beats vanilla by 6.89% (patched
avoids vm_normal_folio overhead)

64k-thp: no diff

pte-mapped-2M thp: vanilla beats patched by 10.71% (vanilla batches over
2M, patched batches over 64K)

Interestingly, I don't see an obvious reason why the last case should have
a win.

Batching over 16 ptes or 512 ptes in this code path, AFAIU is *not* going
to batch

over TLB flushes, atomic ops etc (the tlb_flush_pte_range in
prot_commit_flush_ptes

is an mmu-gather extension and not a TLB flush). So, the fact that similar
operations

are now getting batched should imply better memory access locality, fewer
function

calls etc.

>
> When going from none -> writable we always did a vm_normal_folio() with
> anonymous folios. For the other direction not.
>