Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths

From: Lorenzo Stoakes

Date: Thu Apr 09 2026 - 14:24:46 EST

On Wed, Apr 08, 2026 at 10:09:23AM +0200, David Hildenbrand (Arm) wrote:
> >>
> >> It was also found that adding '--mremap-numa' changes the behavior
> >> substantially:
> >
> > "assign memory mapped pages to randomly selected NUMA nodes. This is
> > disabled for systems that do not support NUMA."
> >
> > so this is just sharding your lock contention across your NUMA nodes (you
> > have an lruvec per node).
> >
> >>
> >> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa
> >> --metrics-brief
> >>
> >> mremap 2570798 29.39 8.06 106.23 87466.50 22494.74
> >>
> >> So it's possible that either actual swapping, or the mbind(...,
> >> MPOL_MF_MOVE) path used by '--mremap-numa', removes most of the excessive
> >> system time.
> >>
> >> Does this look like a known MM scalability issue around short-lived
> >> MAP_POPULATE / munmap churn?
> >
> > Yes. Is this an actual issue on some workload?
>
> Same thought, it's unclear to me why we should care here. In particular,
> when talking about excessive use of zero-filled pages.

Yup, I fear that this might also be misleading - stress-ng is designed to
saturate.

When swapping is enabled, it ends up rate-limited by I/O (there is simultanous
MADV_PAGEOUT occurring).

Then you see lower systime because... the system is sleeping more :)

The zero pages patch stops all that, so you throttle on the next thing - the
lruvec lock.

If you group by NUMA node rather than just not-at-all (the default) you
naturally distribute evenly across lruvec locks, because they're per node (+
memcg whatever).

So all this is arbitrary, it is essentially asking 'what do I rate limit on?'

And 'optimising' things to give different outcomes, esp. on things like system
time, doesn't really make sense.

If you absolutely hammer the hell out of the populate/unmap paths, unevenly over
NUMA nodes, you'll see system time explode because now you're hitting up on the
lruvec lock which is a spinlock (has to be due to possible irq context
invocation).

You're not actually asking 'how fast is this in a real workload?' or even a 'how
fast is this microbenchmark?', you're asking 'what does saturating this look
like?'.

So it's rather asking the wrong question, I fear, and a reason why
stress-ng-as-benchmark has to be treated with caution.

I would definitely recommend examining any underlying real-world workload that
is triggering the issue rather than stress-ng, and then examining closely what's
going on there.

This whole thing might be unfortunately misleading, as you observe saturation of
lruvec lock, but in reality it might simply be a manifestation of:

- syscalls on the hotpath
- not distributing work sensibly over NUMA nodes

Perhaps it is indeed an issue with the lruvec that needs attention, but with a
real world usecase we can perhaps be a little more sure it's that rather than
stress-ng doing it's thing :)

>
> --
> Cheers,
>
> David

Thanks, Lorenzo