Re: [PATCH v2 0/2] mm/mprotect: micro-optimization work

From: Luke Yang

Date: Mon Mar 30 2026 - 16:01:56 EST

Hi Pedro,

Thanks for working on this. I just wanted to share that we've created a
test kernel with your patches and tested on the following CPUs:

--- aarch64 ---
Ampere Altra
Ampere Altra Max

--- x86_64 ---
AMD EPYC 7713
AMD EPYC 7351
AMD EPYC 7542
AMD EPYC 7573X
AMD EPYC 7702
AMD EPYC 9754
Intel Xeon Gold 6126
Into Xeon Gold 6330
Intel Xeon Gold 6530
Intel Xeon Platinum 8351N
Intel Core i7-6820HQ

--- ppc64le ---
IBM Power 10

On average, we see improvements ranging from a minimum of 5% to a
maximum of 55%, with most improvements showing around a 25% speed up in
the libmicro/mprot_tw4m micro benchmark.

Thanks,
Luke

On Tue, Mar 24, 2026 at 11:44 AM Pedro Falcato <pfalcato@xxxxxxx> wrote:
>
> Micro-optimize the change_protection functionality and the
> change_pte_range() routine. This set of functions works in an incredibly
> tight loop, and even small inefficiencies are incredibly evident when spun
> hundreds, thousands or hundreds of thousands of times.
>
> There was an attempt to keep the batching functionality as much as possible,
> which introduced some part of the slowness, but not all of it. Removing it
> for !arm64 architectures would speed mprotect() up even further, but could
> easily pessimize cases where large folios are mapped (which is not as rare
> as it seems, particularly when it comes to the page cache these days).
>
> The micro-benchmark used for the tests was [0] (usable using google/benchmark
> and g++ -O2 -lbenchmark repro.cpp)
>
> This resulted in the following (first entry is baseline):
>
> ---------------------------------------------------------
> Benchmark Time CPU Iterations
> ---------------------------------------------------------
> mprotect_bench 85967 ns 85967 ns 6935
> mprotect_bench 73374 ns 73373 ns 9602
>
>
> After the patchset we can observe a 14% speedup in mprotect. Wonderful
> for the elusive mprotect-based workloads!
>
> Testing & more ideas welcome. I suspect there is plenty of improvement possible
> but it would require more time than what I have on my hands right now. The
> entire inlined function (which inlines into change_protection()) is gigantic
> - I'm not surprised this is so finnicky.
>
> Note: per my profiling, the next _big_ bottleneck here is modify_prot_start_ptes,
> exactly on the xchg() done by x86. ptep_get_and_clear() is _expensive_. I don't think
> there's a properly safe way to go about it since we do depend on the D bit
> quite a lot. This might not be such an issue on other architectures.
>
>
> [0]: https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933
> Link: https://lore.kernel.org/all/aY8-XuFZ7zCvXulB@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
>
> Cc: Vlastimil Babka <vbabka@xxxxxxxxxx>
> Cc: Jann Horn <jannh@xxxxxxxxxx>
> Cc: David Hildenbrand <david@xxxxxxxxxx>
> Cc: Dev Jain <dev.jain@xxxxxxx>
> Cc: Luke Yang <luyang@xxxxxxxxxx>
> Cc: jhladky@xxxxxxxxxx
> Cc: linux-mm@xxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx
>
> v2:
> - Addressed Sashiko's concerns
> - Picked up Lorenzo's R-b's (thank you!)
> - Squashed patch 1 and 4 into a single one (David)
> - Renamed the softleaf leaf function (David)
> - Dropped controversial noinlines & patch 3 (Lorenzo & David)
>
> v1:
> https://lore.kernel.org/linux-mm/20260319183108.1105090-1-pfalcato@xxxxxxx/
>
> Pedro Falcato (2):
> mm/mprotect: move softleaf code out of the main function
> mm/mprotect: special-case small folios when applying write permissions
>
> mm/mprotect.c | 146 ++++++++++++++++++++++++++++----------------------
> 1 file changed, 81 insertions(+), 65 deletions(-)
>
> --
> 2.53.0
>