Re: Re: [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12

From: Chengfeng Lin

Date: Tue May 26 2026 - 04:01:17 EST

Hi Pedro,

Thanks. I prepared a smaller standalone reproducer for the shared-dirty case:

https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/aec9695/mprotect-shared-dirty-toggle/reproducer

It is distilled from the `shared_dirty_full_toggle_64m` scenario in the
generated workload I used for the earlier QEMU/lab runs. It keeps only the
core operation:

- MAP_SHARED | MAP_ANONYMOUS mapping
- write-prefault the whole range
- full-range mprotect(PROT_READ)
- restore with mprotect(PROT_READ | PROT_WRITE)
- write-touch after each protection cycle

The core loop is essentially:

p = mmap(..., MAP_SHARED | MAP_ANONYMOUS, ...);
write_touch(p, len);
for (...) {
mprotect(p, len, PROT_READ);
mprotect(p, len, PROT_READ | PROT_WRITE);
write_touch(p, len);
}

Build/run:

gcc -O2 -Wall -Wextra -o mprotect_shared_dirty_reproducer \
mprotect_shared_dirty_reproducer.c

./mprotect_shared_dirty_reproducer \
shared_dirty_full_toggle_64m 5 \
--mapping-mb 64 \
--iterations 200 \
--warmup 5

The main metric is `iteration_ns_per_page`, lower is better. It is
wall-clock nanoseconds per base page for one full
protect/restore/post-touch iteration. The program also prints
`protect_ns_per_page` and `restore_ns_per_page` separately.

I rebuilt the QEMU direct-boot kernels with an SMP-capable config and reran the
standalone reproducer on the lab machine:

kernels: v6.12.77, v6.19.9, akpm/mm mm-unstable 444fc9435e57
kernel config additions: CONFIG_SMP=y, CONFIG_NR_CPUS=16,
CONFIG_ACPI=y, CONFIG_ACPI_PROCESSOR=y
QEMU_SMP: 1/2/4/8/16
guest memory: 14336 MiB for 1/2/4 CPU, 16384 MiB for 8 CPU,
32768 MiB for 16 CPU
repetitions: 5
order: interleaved
coverage: disabled
extra cmdline: tsc=unstable clocksource=refined-jiffies

I also checked the serial logs. The 1/2/4/8 CPU rows each had 15 serial logs
checked. The 16 CPU full-matrix row had one v6.12.77 QEMU failure, but a
targeted 16 CPU rerun completed cleanly with 15/15 serial logs checked. All
checked logs matched the requested guest CPU count, and none had `noapic` in
the guest cmdline.

`iteration_ns_per_page` results:

CPU v6.12.77 v6.19.9 mm-unstable mm-unstable vs v6.19 gap closed
1 296.4 548.6 498.6 9.1% faster 19.8%
2 327.2 564.8 488.4 13.5% faster 32.2%
4 319.8 578.2 505.8 12.5% faster 28.0%
8 336.4 570.4 508.2 10.9% faster 26.6%
16 380.0 624.0 553.8 11.3% faster 28.8%

The 1/2/4/8 CPU rows are clean screening rows. I would treat 16 CPU as
extended/supporting only because it uses the larger 32 GiB guest-memory setting;
the earlier v6.12.77 QEMU failure appears transient after the clean rerun.

So the standalone reproducer keeps the same broad direction: v6.19.9 is slower
than v6.12.77, and current mm-unstable improves the result but does not return
it to the v6.12.77 level in this setup. The per-phase metrics still put most of
the gap in the protect/restore mprotect phases rather than the post-touch phase.

The lab validation summary is here:

https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/aec9695/mprotect-shared-dirty-toggle/reproducer-validation

One caveat: the standalone run does not collect the same detailed
smaps/pagemap state-shape audit as my separate state-audit run, so I would
treat this as a reproducer/timing screening check. The earlier state audit for
the same workload shape is here:

https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/aec9695/mprotect-shared-dirty-toggle/state-audit-lab

For reference, the original generated workload source and formal profile are:

https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/aec9695/mprotect-shared-dirty-toggle/workload/mprotect_paths_storm.c

https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/aec9695/mprotect-shared-dirty-toggle/experiments/mprotect_shared_dirty_formal_refresh.toml

I can try a narrower bisect next if this reproducer shape is useful.

Thanks,
Chengfeng

> -----Original Message-----
> From: "Pedro Falcato" <pfalcato@xxxxxxx>
> Sent:Monday, 05/25/2026 18:29:17
> To: "Chengfeng Lin" <23020251154299@xxxxxxxxxxxxxx>
> Cc: "David Hildenbrand (Arm)" <david@xxxxxxxxxx>, "Andrew Morton" <akpm@xxxxxxxxxxxxxxxxxxxx>, linux-mm@xxxxxxxxx, "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>, "Vlastimil Babka" <vbabka@xxxxxxx>, "Jann Horn" <jannh@xxxxxxxxxx>, "Johannes Weiner" <hannes@xxxxxxxxxxx>, "Michal Hocko" <mhocko@xxxxxxxxxx>, "Qi Zheng" <zhengqi.arch@xxxxxxxxxxxxx>, "Shakeel Butt" <shakeel.butt@xxxxxxxxx>, "Chris Li" <chrisl@xxxxxxxxxx>, "Kairui Song" <kasong@xxxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, regressions@xxxxxxxxxxxxxxx, ljs@xxxxxxxxxx
> Subject: Re: [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12
>
> On Fri, May 22, 2026 at 05:03:44PM +0800, Chengfeng Lin wrote:
> > Hi David,
> >
> > Thanks for the pointer. I tested the current akpm/mm mm-unstable branch at
> > 444fc9435e57, which contains Pedro's v3 two-patch mprotect series: the
> > softleaf refactor and the relevant small-folio / nr_ptes == 1 changes.
> >
> > I first ran a local sanity check, and then reran the same shared-dirty
> > full-range toggle workload on the lab machine:
> >
> > kernels: v6.12.77, v6.19.9, akpm/mm mm-unstable 444fc9435e57
> > QEMU: direct boot
> > lab guest CPUs: QEMU_SMP=1/2/4/8/16
> > lab guest memory: 14336 MiB for 1/2/4 CPU, 16384 MiB for 8 CPU,
> > 32768 MiB for 16 CPU
> > repetitions: 9
> > order: interleaved
> > coverage: disabled
> >
> > The primary metric is cycle_ns_per_page, lower is better. Here "cycle" means
> > one workload iteration, not CPU cycles:
> >
> > CPU v6.12.77 v6.19.9 mm-unstable mm-unstable vs v6.19 gap closed
> > 1 336.1 532.0 497.0 6.6% faster 17.9%
> > 2 369.2 581.9 503.3 13.5% faster 36.9%
> > 4 355.7 587.2 524.2 10.7% faster 27.2%
> > 8 369.7 583.6 534.2 8.5% faster 23.1%
> > 16 374.8 607.1 547.8 9.8% faster 25.5%
> >
> > The 1/2/4/8 CPU rows completed 9/9 runs for all three kernels. In the
> > 16 CPU row, v6.12.77 had one QEMU failure, so I would treat that row only
> > as a supporting trend.
> >
> > So yes, Pedro's small-folio work does reduce this synthetic shared-dirty
> > signal in my setup. It does not seem to remove most of the gap to v6.12.77:
> > looking at cycle_ns_per_page, it closes roughly 18-37% of the v6.12 ->
> > v6.19 gap in the clean 1/2/4/8 CPU lab rows.
> >
> > I also ran a separate state-shape audit, because the MADV_PAGEOUT follow-up
> > showed that a timing delta can be misleading if the compared kernels are not
> > actually operating on the same page state. For this mprotect workload, the
> > successful runs across v6.12.77, v6.19.9, and mm-unstable all used the same
> > 4 KiB shared-dirty PTE mapping shape:
> >
> > expected_match_ratio = 100
> > unexpected_results = 0
> > final_vmas_avg = 1
> > present pages before/after protect = 16384 / 16384
> > AnonHugePages = 0
> > KernelPageSize/MMUPageSize = 4 KiB / 4 KiB
> > THPeligible = 0
> >
> > The state audit used the same 1/2/4/8/16 CPU and memory matrix, with 5 runs
> > per kernel. The 1/2/4/8 CPU rows completed 5/5 for all three kernels; the
> > 16 CPU row had one v6.19.9 QEMU failure, but the successful v6.19.9 runs had
> > the same state-shape values.
> >
> > I put the follow-up summaries here:
> >
> > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/0c0e2d9/mprotect-shared-dirty-toggle/mm-unstable-lab-sanity
> >
> > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/0c0e2d9/mprotect-shared-dirty-toggle/state-audit-lab
> >
> > Given Lorenzo's question and the synthetic nature of this workload, I will
> > avoid treating this as a strong regression claim unless I can provide a
> > standalone reproducer and/or a narrower bisect. If this remaining signal is
> > still useful to characterize, I can prepare a smaller standalone reproducer
> > or try to bisect the remaining gap.
>
> Yes, if you could give me more pointers (and a simpler repro) I would be happy
> to take a quick look. Otherwise there's not much I can do here :)
>
> --
> Pedro