[RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths

From: Joseph Salisbury

Date: Tue Apr 07 2026 - 16:10:09 EST

Hello,

I would like to ask for feedback on an MM performance issue triggered by stress-ng's mremap stressor:

stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief

This was first investigated as a possible regression from 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap"), but the current evidence suggests that commit is mostly exposing an older problem for this workload rather than directly causing it.

Observed behavior:

The metrics below are in this format:
stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
(secs) (secs) (secs) (real time) (usr+sys time)

On a 5.15-based kernel, the workload behaves much worse when swapping is disabled:

swap enabled:
mremap 1660980 31.08 64.78 84.63 53437.09 11116.73

swap disabled:
mremap 40786258 27.94 15.41 15354.79 1459749.43 2653.59

On a 6.12-based kernel with swap enabled, the same high-system-time behavior is also observed:

mremap 77087729 21.50 29.95 30558.08 3584738.22 2520.19

A recent 7.0-rc5-based mainline build still behaves similarly:

mremap 39208813 28.12 12.34 15318.39 1394408.50 2557.53

So this does not appear to be already fixed upstream.

The current theory is that 0ca0c24e3211 exposes this specific zero-page-heavy workload. Before that change, swap-enabled runs actually swapped pages. After that change, zero pages are stored in the swap bitmap instead, so the workload behaves much more like the swap-disabled case.

Perf data supports the idea that the expensive behavior is global LRU lock contention caused by short-lived populate/unmap churn.

The dominant stacks on the bad cases include:

vm_mmap_pgoff
__mm_populate
populate_vma_page_range
lru_add_drain
folio_batch_move_lru
folio_lruvec_lock_irqsave
native_queued_spin_lock_slowpath

and:

__x64_sys_munmap
__vm_munmap
...
release_pages
folios_put_refs
__page_cache_release
folio_lruvec_relock_irqsave
native_queued_spin_lock_slowpath

It was also found that adding '--mremap-numa' changes the behavior substantially:

stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa --metrics-brief

mremap 2570798 29.39 8.06 106.23 87466.50 22494.74

So it's possible that either actual swapping, or the mbind(..., MPOL_MF_MOVE) path used by '--mremap-numa', removes most of the excessive system time.

Does this look like a known MM scalability issue around short-lived MAP_POPULATE / munmap churn?

REPRODUCER:
The issue is reproducible with stress-ng's mremap stressor:

stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief

On older kernels, the bad behavior is easiest to expose by disabling swap first:

swapoff -a
stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief

On kernels with 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap") or newer, the same bad behavior can be seen even with swap enabled, because this zero-page-heavy workload no longer actually swaps pages and behaves much like the swap-disabled case.

Typical bad-case behaviour:
- Very large aggregate sys time during a 30s run (for example, ~15000s or higher)
- Poor bogo ops/s measured against usr+sys time (~2500 range in our tests)
- Perf shows time dominated by:
vm_mmap_pgoff -> __mm_populate -> populate_vma_page_range -> lru_add_drain
and
munmap -> release_pages -> __page_cache_release
with heavy time in folio_lruvec_lock_irqsave/native_queued_spin_lock_slowpath

Diagnostic variant:
stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa --metrics-brief

That variant greatly reduces the excessive system time, which is one of the clues that the excessive system-time overhead depends on which MM path the workload takes.

Thanks in advance!

Joe