Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
From: Lorenzo Stoakes
Date: Thu Apr 09 2026 - 14:24:46 EST
On Wed, Apr 08, 2026 at 10:09:23AM +0200, David Hildenbrand (Arm) wrote:
> >>
> >> It was also found that adding '--mremap-numa' changes the behavior
> >> substantially:
> >
> > "assign memory mapped pages to randomly selected NUMA nodes. This is
> > disabled for systems that do not support NUMA."
> >
> > so this is just sharding your lock contention across your NUMA nodes (you
> > have an lruvec per node).
> >
> >>
> >> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa
> >> --metrics-brief
> >>
> >> mremap 2570798 29.39 8.06 106.23 87466.50 22494.74
> >>
> >> So it's possible that either actual swapping, or the mbind(...,
> >> MPOL_MF_MOVE) path used by '--mremap-numa', removes most of the excessive
> >> system time.
> >>
> >> Does this look like a known MM scalability issue around short-lived
> >> MAP_POPULATE / munmap churn?
> >
> > Yes. Is this an actual issue on some workload?
>
> Same thought, it's unclear to me why we should care here. In particular,
> when talking about excessive use of zero-filled pages.
Yup, I fear that this might also be misleading - stress-ng is designed to
saturate.
When swapping is enabled, it ends up rate-limited by I/O (there is simultanous
MADV_PAGEOUT occurring).
Then you see lower systime because... the system is sleeping more :)
The zero pages patch stops all that, so you throttle on the next thing - the
lruvec lock.
If you group by NUMA node rather than just not-at-all (the default) you
naturally distribute evenly across lruvec locks, because they're per node (+
memcg whatever).
So all this is arbitrary, it is essentially asking 'what do I rate limit on?'
And 'optimising' things to give different outcomes, esp. on things like system
time, doesn't really make sense.
If you absolutely hammer the hell out of the populate/unmap paths, unevenly over
NUMA nodes, you'll see system time explode because now you're hitting up on the
lruvec lock which is a spinlock (has to be due to possible irq context
invocation).
You're not actually asking 'how fast is this in a real workload?' or even a 'how
fast is this microbenchmark?', you're asking 'what does saturating this look
like?'.
So it's rather asking the wrong question, I fear, and a reason why
stress-ng-as-benchmark has to be treated with caution.
I would definitely recommend examining any underlying real-world workload that
is triggering the issue rather than stress-ng, and then examining closely what's
going on there.
This whole thing might be unfortunately misleading, as you observe saturation of
lruvec lock, but in reality it might simply be a manifestation of:
- syscalls on the hotpath
- not distributing work sensibly over NUMA nodes
Perhaps it is indeed an issue with the lruvec that needs attention, but with a
real world usecase we can perhaps be a little more sure it's that rather than
stress-ng doing it's thing :)
>
> --
> Cheers,
>
> David
Thanks, Lorenzo