Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths

From: Joseph Salisbury

Date: Thu Apr 09 2026 - 13:27:07 EST

On 4/9/26 12:37 PM, Haakon Bugge wrote:

On 8 Apr 2026, at 16:27, Joseph Salisbury <joseph.salisbury@xxxxxxxxxx> wrote:

On 4/8/26 4:09 AM, David Hildenbrand (Arm) wrote:

It was also found that adding '--mremap-numa' changes the behavior
substantially:

"assign memory mapped pages to randomly selected NUMA nodes. This is
disabled for systems that do not support NUMA."

so this is just sharding your lock contention across your NUMA nodes (you
have an lruvec per node).

stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa
--metrics-brief

mremap 2570798 29.39 8.06 106.23 87466.50 22494.74

So it's possible that either actual swapping, or the mbind(...,
MPOL_MF_MOVE) path used by '--mremap-numa', removes most of the excessive
system time.

Does this look like a known MM scalability issue around short-lived
MAP_POPULATE / munmap churn?

Yes. Is this an actual issue on some workload?

Same thought, it's unclear to me why we should care here. In particular,
when talking about excessive use of zero-filled pages.

Currently this is only showing up with that particular stress test. We will try John's patch and provide feedback.

Thanks for all the feedback, everyone!

I reported this internally and have worked with Joseph on it. I tested v7.0-rc7-68-g7f87a5ea75f01 ("-"), "Base", vs. ditto plus John Hubbard's patch ("+"), "Test".

Stress-ng command: stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief

System is an AMD EPYC 9J45:
NUMA node(s): 2
NUMA node0 CPU(s): 0-127,256-383
NUMA node1 CPU(s): 128-255,384-511

The stress-ng command was run ten times and here are the averages and pstdev:

bogo ops/s pstdev system time pstdev
(realtime)
--------------------------------------------
- 3192638 35% 24041 32%
+ 3657904 5% 15278 0%

This is 15% improvement in bogo ops/s (realtime) and a decent 36% reduction in system time.

I shamelessly copied and modified the fio command from [1]. I ran:

# fio -filename=/dev/nvme0n1 -direct=0 -thread -size=1024G -rwmixwrite=30 \
--norandommap --randrepeat=0 -ioengine=mmap -bs=4k -numjobs=1024 -runtime=3600 \
--time_based -group_reporting -name=mytest

(that is, one hour runtime)

- read: IOPS=14.0M, BW=53.4GiB/s (57.3GB/s)(188TiB/3608413msec)
+ read: IOPS=16.0M, BW=61.2GiB/s (65.7GB/s)(215TiB/3600051msec)
- READ: bw=53.4GiB/s (57.3GB/s), 53.4GiB/s-53.4GiB/s (57.3GB/s-57.3GB/s), io=188TiB (207TB), run=3608413-3608413msec
+ READ: bw=61.2GiB/s (65.7GB/s), 61.2GiB/s-61.2GiB/s (65.7GB/s-65.7GB/s), io=215TiB (237TB), run=3600051-3600051msec

Also, running Base, I see tons of:

Jobs: 726 (f=726): [_(2),R(1),_(1),R(3),_(4),R(6),_(1),R(2),_(2),R(2),_(3),R(1),_(5),R(2),_(1),R(2),_(1),R(1),_(2),R(2),_(1),R(1),_(1),R(2),_(1),R(3),_(1),R(3),_(1),R(1),_(1),R(1),_(1),R(1),_(1),R(3),_(1),R(3),_(1),R(1),_(3),R(1),_(1),R(5),_(1),R(5),_(1),R(1),_(2),R(1),_(4),R(2),_(1),R(3),_(1),R(3),_(1),R(1),_(2),R(1),_(1),R(8),_(1),R(4),_(1),R(3),_(1),R(1),_(1),R(2),_(1),R(7),_(2),R(2)

when the fio test terminates, which I do not see using Test. I take that as the threads do not terminate timely using the Base kernel.

Thxs, Håkon

[1] https://lkml.org/lkml/2024/7/3/1049

Adding Lorenzo Stoakes to Cc.