Re: [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12

From: Chengfeng Lin

Date: Mon May 18 2026 - 09:20:10 EST

Sorry, I sent the previous report with the wrong subject line.

The intended subject is:

[REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12

The body and evidence links in that message are for the mprotect
shared-dirty PTE toggle regression. Please treat it as the mprotect
report, not as a MADV_PAGEOUT report.

#regzbot title: mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12

Sorry for the noise.

> -----原始邮件-----
> 发件人: "Chengfeng Lin" <23020251154299@xxxxxxxxxxxxxx>
> 发送时间:2026-05-18 21:01:02 (星期一)
> 收件人: "Andrew Morton" <akpm@xxxxxxxxxxxxxxxxxxxx>, linux-mm@xxxxxxxxx
> 抄送: "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>, "Lorenzo Stoakes" <lorenzo.stoakes@xxxxxxxxxx>, "David Hildenbrand" <david@xxxxxxxxxx>, "Vlastimil Babka" <vbabka@xxxxxxx>, "Jann Horn" <jannh@xxxxxxxxxx>, "Johannes Weiner" <hannes@xxxxxxxxxxx>, "Michal Hocko" <mhocko@xxxxxxxxxx>, "Qi Zheng" <zhengqi.arch@xxxxxxxxxxxxx>, "Shakeel Butt" <shakeel.butt@xxxxxxxxx>, "Chris Li" <chrisl@xxxxxxxxxx>, "Kairui Song" <kasong@xxxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, regressions@xxxxxxxxxxxxxxx
> 主题: [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
>
> Hi,
>
> I would like to report a userspace-visible mprotect() performance
> regression in a shared dirty PTE workload.
>
> The workload is intentionally narrow:
>
> - anonymous shared 64 MiB mapping
> - prefault before protection changes
> - repeatedly toggle the whole range with mprotect(PROT_READ)
> - restore with mprotect(PROT_READ | PROT_WRITE)
> - write-touch after the protection cycle
>
> This is not meant as a generic mprotect() regression report. In
> particular, I am not claiming that the anon/THP mprotect paths regress.
> The current signal is scoped to the shared-dirty full-range PTE toggle
> path above.
>
> The current public evidence bundle is here:
>
> https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle
>
> The generated workload source used for auditing the workload semantics is
> here:
>
> https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/e13469b/mprotect-shared-dirty-toggle/workload/mprotect_paths_storm.c
>
> The formal experiment profile is here:
>
> https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle/experiments
>
> The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
> configuration, using QEMU direct boot. The formal performance runs were
> clean timing runs with coverage disabled. Coverage was collected
> separately and is not used for the timing numbers below.
>
> Lab environment:
>
> host label: lcf
> host kernel: Linux 6.14.0-37-generic x86_64
> QEMU: qemu-system-x86_64 8.2.2
> container/cgroup CPU set: 0,2,4,6,8,10,12,14
> container/cgroup memory limit: 16106127360 bytes
> guest memory: QEMU_MEM_MB=14336
> guest CPUs: QEMU_SMP=1/2/4
> repetitions: 9
> version order: interleaved
> performance coverage_enabled: false
>
> Primary result, cycle_ns_per_page, lower is better:
>
> CPU v6.12.77 v6.19.9 old-lower-vs-new v6.19/v6.12 reliability
> 1 346.8 578.1 40.0% 1.67x reliable
> 2 394.7 641.7 38.5% 1.63x robust-only
> 4 381.1 624.8 39.0% 1.64x partial, same direction
>
> The strongest current result is the 1CPU lab formal result. The 2CPU case
> is same-direction but robust-only in the framework classification. The
> 4CPU case is same-direction but partial because one QEMU run failed; the
> summary still has 8 successful runs for that CPU count.
>
> The current mechanism hypothesis is local to the shared-dirty PTE path.
> In v6.19, the measured hot path goes through the change_pte_range()
> batching machinery:
>
> change_pte_range()
> -> mprotect_folio_pte_batch()
> -> modify_prot_start_ptes()
> -> set_write_prot_commit_flush_ptes()
> -> prot_commit_flush_ptes()
>
> For this shared-dirty workload, follow-up batch-probe attribution showed
> nr_ptes=1 in the measured path. The hypothesis is that the extra folio
> lookup, batch-size query, helper dispatch, and commit machinery are paid
> per 4 KiB PTE without effective batch-size amortization in this workload.
> This is mechanism interpretation, not a completed culprit-commit bisect.
>
> I have not bisected the exact culprit commit yet. Separate release-level
> sanity checks showed v6.18.19 already in the slow range, so the current
> best reporting range is:
>
> #regzbot introduced: v6.12..v6.18
>
> Please let me know if a standalone reproducer, a narrower bisect, or
> additional raw logs would be more useful.