Re: [PATCH v4 0/4] Optimize mprotect() for large folios

From: Dev Jain
Date: Mon Jun 30 2025 - 07:26:00 EST

Next message: Ilpo Järvinen: "Re: [PATCH v2 1/5] platform/x86:intel/pmc: Enable SSRAM support for Lunar Lake"
Previous message: Ujwal Kundur: "Re: [PATCH v5 1/1] selftests/mm/uffd: Refactor non-composite global vars into struct"
In reply to: Lorenzo Stoakes: "Re: [PATCH v4 0/4] Optimize mprotect() for large folios"
Next in thread: Lorenzo Stoakes: "Re: [PATCH v4 0/4] Optimize mprotect() for large folios"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 30/06/25 4:47 pm, Lorenzo Stoakes wrote:

On Sat, Jun 28, 2025 at 05:04:31PM +0530, Dev Jain wrote:

This patchset optimizes the mprotect() system call for large folios
by PTE-batching. No issues were observed with mm-selftests, build
tested on x86_64.

Should also be tested on x86-64 not only build tested :)

You are still not really giving details here, so same comment as your mremap()
series, please explain why you're doing this, what for, what benefits you expect
to achieve, where etc.

E.g. 'this is deisgned to optimise mTHP cases on arm64, we expect to see
benefits on amd64 also and for intel there should be no impact'.

Okay.

It's probably also worth actually going and checking to make sure that this is
the case re: other arches. See below on that...

We use the following test cases to measure performance, mprotect()'ing
the mapped memory to read-only then read-write 40 times:

Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
pte-mapping those THPs
Test case 2: Mapping 1G of memory with 64K mTHPs
Test case 3: Mapping 1G of memory with 4K pages

Average execution time on arm64, Apple M3:
Before the patchset:
T1: 7.9 seconds T2: 7.9 seconds T3: 4.2 seconds

After the patchset:
T1: 2.1 seconds T2: 2.2 seconds T3: 4.3 seconds

Observing T1/T2 and T3 before the patchset, we also remove the regression
introduced by ptep_get() on a contpte block. And, for large folios we get
an almost 74% performance improvement, albeit the trade-off being a slight
degradation in the small folio case.

This is nice, though order-0 is probably going to be your bread and butter no?

Having said that, mprotect() is not a hot path, this delta is small enough to
quite possibly just be noise, and personally I'm not all that bothered.

It is only the vm_normal_folio() + folio_test_large() overhead. Trying to avoid
this by the horrible maybe_contiguous_pte_pfns() I introduced somewhere else
is not worth it : )

But let's run this same test on x86-64 too please and get some before/after
numbers just to confirm no major impact.

Thanks for including code.

Here is the test program:

#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>

#define SIZE (1024*1024*1024)

unsigned long pmdsize = (1UL << 21);
unsigned long pagesize = (1UL << 12);

static void pte_map_thps(char *mem, size_t size)
{
size_t offs;
int ret = 0;

/* PTE-map each THP by temporarily splitting the VMAs. */
for (offs = 0; offs < size; offs += pmdsize) {
ret |= madvise(mem + offs, pagesize, MADV_DONTFORK);
ret |= madvise(mem + offs, pagesize, MADV_DOFORK);
}

if (ret) {
fprintf(stderr, "ERROR: mprotect() failed\n");
exit(1);
}
}

int main(int argc, char *argv[])
{
char *p;
int ret = 0;
p = mmap((1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (p != (1UL << 30)) {
perror("mmap");
return 1;
}

memset(p, 0, SIZE);
if (madvise(p, SIZE, MADV_NOHUGEPAGE))
perror("madvise");
explicit_bzero(p, SIZE);
pte_map_thps(p, SIZE);

for (int loops = 0; loops < 40; loops++) {
if (mprotect(p, SIZE, PROT_READ))
perror("mprotect"), exit(1);
if (mprotect(p, SIZE, PROT_READ|PROT_WRITE))
perror("mprotect"), exit(1);
explicit_bzero(p, SIZE);
}
}

---
The patchset is rebased onto Saturday's mm-new.

v3->v4:
- Refactor skipping logic into a new function, edit patch 1 subject
to highlight it is only for MM_CP_PROT_NUMA case (David H)
- Refactor the optimization logic, add more documentation to the generic
batched functions, do not add clear_flush_ptes, squash patch 4
and 5 (Ryan)

v2->v3:
- Add comments for the new APIs (Ryan, Lorenzo)
- Instead of refactoring, use a "skip_batch" label
- Move arm64 patches at the end (Ryan)
- In can_change_pte_writable(), check AnonExclusive page-by-page (David H)
- Resolve implicit declaration; tested build on x86 (Lance Yang)

v1->v2:
- Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the header more resilient)
- Abridge the anon-exclusive condition (Lance Yang)

Dev Jain (4):
mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
mm: Add batched versions of ptep_modify_prot_start/commit
mm: Optimize mprotect() by PTE-batching
arm64: Add batched versions of ptep_modify_prot_start/commit

arch/arm64/include/asm/pgtable.h | 10 ++
arch/arm64/mm/mmu.c | 28 +++-
include/linux/pgtable.h | 83 +++++++++-
mm/mprotect.c | 269 +++++++++++++++++++++++--------
4 files changed, 315 insertions(+), 75 deletions(-)

--
2.30.2

Next message: Ilpo Järvinen: "Re: [PATCH v2 1/5] platform/x86:intel/pmc: Enable SSRAM support for Lunar Lake"
Previous message: Ujwal Kundur: "Re: [PATCH v5 1/1] selftests/mm/uffd: Refactor non-composite global vars into struct"
In reply to: Lorenzo Stoakes: "Re: [PATCH v4 0/4] Optimize mprotect() for large folios"
Next in thread: Lorenzo Stoakes: "Re: [PATCH v4 0/4] Optimize mprotect() for large folios"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]