Re: [REGRESSION] mm/mprotect: 2x+ slowdown for >=400KiB regions since PTE batching (cac1db8c3aad)
From: Dev Jain
Date: Mon Feb 16 2026 - 05:12:44 EST
On 13/02/26 10:56 pm, David Hildenbrand (Arm) wrote:
> On 2/13/26 18:16, Suren Baghdasaryan wrote:
>> On Fri, Feb 13, 2026 at 4:24 PM Pedro Falcato <pfalcato@xxxxxxx> wrote:
>>>
>>> On Fri, Feb 13, 2026 at 04:47:29PM +0100, David Hildenbrand (Arm) wrote:
>>>>
>>>> Hi!
>>>>
>>>>
>>>> Micro-benchmark results are nice. But what is the real word impact?
>>>> IOW, why
>>>> should we care?
>>>
>>> Well, mprotect is widely used in thread spawning, code JITting,
>>> and even process startup. And we don't want to pay for a feature we can't
>>> even use (on x86).
>>
>> I agree. When I straced Android's zygote a while ago, mprotect() came
>> up #30 in the list of most frequently used syscalls and one of the
>> most used mm-related syscalls due to its use during process creation.
>> However, I don't know how often it's used on VMAs of size >=400KiB.
>
> See my point? :) If this is apparently so widespread then finding a real
> reproducer is likely not a problem. Otherwise it's just speculation.
>
> It would also be interesting to know whether the reproducer ran with any
> sort of mTHP enabled or not.
Yes. Luke, can you experiment with the following microbenchmark:
https://pastebin.com/3hNtYirT
and see if there is an optimization for pte-mapped 2M folios, before and
after the commit?
(set transparent_hugepages/enabled=always, hugepages-2048Kb/enabled=always)
>
>>
>>>
>>> In any case, I think I see the problem. Namely, that we now need to call
>>> vm_normal_folio() for every single PTE (this seems similar to the mremap
>>> problem caught in 0b5be138ce00f421bd7cc5a226061bd62c4ab850). I'll try to
>>> draft up a patch over the weekend if I can.
>
> I think we excessively discussed that during review and fixups of the
> commit in question. You might want to dig through that because I could
> have sworn we might already have discussed how to optimize this.
I have written a patch to call vm_normal_folio only when required, and use
pte_batch_hint
instead of vm_normal_folio + folio_pte_batch. The results, testing with
https://pastebin.com/3hNtYirT on Apple M3:
without-thp (small 4K folio case): patched beats vanilla by 6.89% (patched
avoids vm_normal_folio overhead)
64k-thp: no diff
pte-mapped-2M thp: vanilla beats patched by 10.71% (vanilla batches over
2M, patched batches over 64K)
Interestingly, I don't see an obvious reason why the last case should have
a win.
Batching over 16 ptes or 512 ptes in this code path, AFAIU is *not* going
to batch
over TLB flushes, atomic ops etc (the tlb_flush_pte_range in
prot_commit_flush_ptes
is an mmu-gather extension and not a TLB flush). So, the fact that similar
operations
are now getting batched should imply better memory access locality, fewer
function
calls etc.
>
> When going from none -> writable we always did a vm_normal_folio() with
> anonymous folios. For the other direction not.
>