Re: [linus:master] [mm] 9890ecab6a: vm-scalability.throughput 3.8% regression

From: Ankur Arora

Date: Wed Mar 11 2026 - 15:06:02 EST

kernel test robot <oliver.sang@xxxxxxxxx> writes:

> Hello,
>
> kernel test robot noticed a 3.8% regression of vm-scalability.throughput on:
>

[ ... ]

> testcase: vm-scalability
> config: x86_64-rhel-9.4
> compiler: gcc-14
> test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory
> parameters:
>
> runtime: 300s
> size: 8T
> test: anon-w-seq-mt
> cpufreq_governor: performance

This test exercises the THP sequential zeroing path.

> 15142 -4.2% 14512 vm-scalability.time.system_time
> 16939 +6.1% 17975 vm-scalability.time.user_time

stime drops because folio_zero_user() is more efficient. But utime goes
up because of higher user miss rate since folio_zero_user() is now clearing
sequentially instead of the earlier left right fashion:

> 61.69 +9.8 71.51 perf-stat.i.cache-miss-rate%
> 6.147e+08 +10.7% 6.805e+08 perf-stat.i.cache-misses
> 9.904e+08 -4.7% 9.436e+08 perf-stat.i.cache-references
> 2.17 +8.2% 2.35 perf-stat.i.cpi

I had noted similar behaviour with anon-w-seq-hugetlb in 93552c9a3350:

vm-scalability/anon-w-seq-hugetlb: this workload runs with 384 processes
(one for each CPU) each zeroing anonymously mapped hugetlb memory which
is then accessed sequentially.

stime utime

discontiguous-page 1739.93 ( +- 6.15% ) 1016.61 ( +- 4.75% )
contiguous-page 1853.70 ( +- 2.51% ) 1187.13 ( +- 3.50% )
batched-pages 1756.75 ( +- 2.98% ) 1133.32 ( +- 4.89% )
neighbourhood-last 1725.18 ( +- 4.59% ) 1123.78 ( +- 7.38% )

Both stime and utime largely respond somewhat expectedly. There is a
fair amount of run to run variation but the general trend is that the
stime drops and utime increases. There are a few oddities, like
contiguous-page performing very differently from batched-pages.

As such this is likely an uncommon pattern where we saturate the memory
bandwidth (since all CPUs are running the test) and at the same time
are cache constrained because we access the entire region.

Ankur

> cb431accb36e51b6 9890ecab6ad9c0d3d342469f3b6
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 0.08 ± 3% +8.7% 0.09 ± 3% vm-scalability.free_time
> 357969 -6.6% 334511 vm-scalability.median
> 1.034e+08 -3.8% 99382138 vm-scalability.throughput
> 634243 -13.6% 548120 ± 6% vm-scalability.time.involuntary_context_switches
> 12706518 -6.6% 11872543 vm-scalability.time.minor_page_faults
> 15142 -4.2% 14512 vm-scalability.time.system_time
> 16939 +6.1% 17975 vm-scalability.time.user_time
> 251227 -6.8% 234071 vm-scalability.time.voluntary_context_switches
> 1.791e+10 -6.6% 1.674e+10 vm-scalability.workload
> 0.30 -7.5% 0.28 turbostat.IPC
> 9203 -5.5% 8693 vmstat.system.cs
> 0.08 +0.0 0.08 mpstat.cpu.all.soft%
> 25.14 +1.5 26.62 mpstat.cpu.all.usr%
> 3.13 +18.3% 3.71 perf-stat.i.MPKI
> 6.22e+10 -6.6% 5.81e+10 perf-stat.i.branch-instructions
> 61.69 +9.8 71.51 perf-stat.i.cache-miss-rate%
> 6.147e+08 +10.7% 6.805e+08 perf-stat.i.cache-misses
> 9.904e+08 -4.7% 9.436e+08 perf-stat.i.cache-references
> 9303 -5.2% 8823 perf-stat.i.context-switches
> 2.17 +8.2% 2.35 perf-stat.i.cpi
> 598.97 -4.6% 571.28 perf-stat.i.cpu-migrations
> 1.95e+11 -6.6% 1.822e+11 perf-stat.i.instructions
> 0.47 -7.2% 0.43 perf-stat.i.ipc
> 43153 -6.5% 40334 perf-stat.i.minor-faults
> 43153 -6.5% 40335 perf-stat.i.page-faults
> 3.16 +18.5% 3.74 perf-stat.overall.MPKI
> 0.02 +0.0 0.03 perf-stat.overall.branch-miss-rate%
> 62.11 +10.1 72.19 perf-stat.overall.cache-miss-rate%
> 2.19 +8.3% 2.37 perf-stat.overall.cpi
> 692.89 -8.6% 633.07 perf-stat.overall.cycles-between-cache-misses
> 0.46 -7.6% 0.42 perf-stat.overall.ipc
> 6.121e+10 -6.8% 5.705e+10 perf-stat.ps.branch-instructions
> 6.054e+08 +10.5% 6.689e+08 perf-stat.ps.cache-misses
> 9.747e+08 -4.9% 9.266e+08 perf-stat.ps.cache-references
> 9124 -5.6% 8613 perf-stat.ps.context-switches
> 583.66 -4.9% 555.21 perf-stat.ps.cpu-migrations
> 1.919e+11 -6.8% 1.789e+11 perf-stat.ps.instructions
> 42389 -6.7% 39549 perf-stat.ps.minor-faults
> 42389 -6.7% 39549 perf-stat.ps.page-faults
> 5.812e+13 -6.5% 5.434e+13 perf-stat.total.instructions
> 40.26 -40.3 0.00 perf-profile.calltrace.cycles-pp.clear_subpage.folio_zero_user.vma_alloc_anon_folio_pmd.__do_huge_pmd_anonymous_page.__handle_mm_fault
> 40.76 -2.1 38.66 perf-profile.calltrace.cycles-pp.folio_zero_user.vma_alloc_anon_folio_pmd.__do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault
> 42.59 -2.0 40.61 perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
> 42.54 -2.0 40.57 perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
> 42.54 -2.0 40.57 perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
> 42.40 -2.0 40.43 perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
> 42.32 -2.0 40.36 perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
> 42.23 -2.0 40.27 perf-profile.calltrace.cycles-pp.__do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
> 41.70 -2.0 39.74 perf-profile.calltrace.cycles-pp.vma_alloc_anon_folio_pmd.__do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
> 0.76 -0.0 0.72 perf-profile.calltrace.cycles-pp.vma_alloc_folio_noprof.vma_alloc_anon_folio_pmd.__do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault
> 0.72 -0.0 0.68 perf-profile.calltrace.cycles-pp.__alloc_frozen_pages_noprof.alloc_pages_mpol.vma_alloc_folio_noprof.vma_alloc_anon_folio_pmd.__do_huge_pmd_anonymous_page
> 0.67 -0.0 0.64 perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_frozen_pages_noprof.alloc_pages_mpol.vma_alloc_folio_noprof.vma_alloc_anon_folio_pmd
> 0.72 -0.0 0.69 perf-profile.calltrace.cycles-pp.alloc_pages_mpol.vma_alloc_folio_noprof.vma_alloc_anon_folio_pmd.__do_huge_pmd_anonymous_page.__handle_mm_fault
> 0.56 -0.0 0.54 perf-profile.calltrace.cycles-pp.prep_new_page.get_page_from_freelist.__alloc_frozen_pages_noprof.alloc_pages_mpol.vma_alloc_folio_noprof
> 0.00 +0.8 0.76 ± 2% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.folio_zero_user.vma_alloc_anon_folio_pmd.__do_huge_pmd_anonymous_page.__handle_mm_fault
> 30.25 +1.2 31.46 perf-profile.calltrace.cycles-pp.do_rw_once
> 40.49 -40.5 0.00 perf-profile.children.cycles-pp.clear_subpage
> 42.61 -2.0 40.63 perf-profile.children.cycles-pp.asm_exc_page_fault
> 42.55 -2.0 40.58 perf-profile.children.cycles-pp.exc_page_fault
> 42.54 -2.0 40.57 perf-profile.children.cycles-pp.do_user_addr_fault
> 42.40 -2.0 40.43 perf-profile.children.cycles-pp.handle_mm_fault
> 42.33 -2.0 40.36 perf-profile.children.cycles-pp.__handle_mm_fault
> 42.23 -2.0 40.27 perf-profile.children.cycles-pp.__do_huge_pmd_anonymous_page
> 41.70 -2.0 39.74 perf-profile.children.cycles-pp.vma_alloc_anon_folio_pmd
> 40.83 -1.9 38.92 perf-profile.children.cycles-pp.folio_zero_user
> 63.93 -1.2 62.77 perf-profile.children.cycles-pp.do_access
> 0.95 -0.0 0.91 perf-profile.children.cycles-pp.__alloc_frozen_pages_noprof
> 0.78 -0.0 0.74 perf-profile.children.cycles-pp.vma_alloc_folio_noprof
> 0.95 -0.0 0.92 perf-profile.children.cycles-pp.alloc_pages_mpol
> 0.79 -0.0 0.76 perf-profile.children.cycles-pp.get_page_from_freelist
> 0.63 -0.0 0.60 perf-profile.children.cycles-pp.prep_new_page
> 40.31 +2.5 42.80 perf-profile.children.cycles-pp.do_rw_once
> 39.77 -39.8 0.00 perf-profile.self.cycles-pp.clear_subpage
> 9.54 -0.3 9.23 perf-profile.self.cycles-pp.do_access
> 0.55 -0.0 0.53 perf-profile.self.cycles-pp.prep_new_page
> 38.35 +2.6 40.96 perf-profile.self.cycles-pp.do_rw_once
> 0.36 ± 2% +38.0 38.32 perf-profile.self.cycles-pp.folio_zero_user
>
>
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.