Re: [PATCH] mm: vma: skip anonymous vma when inserting vma to file rmap tree

From: Pedro Falcato
Date: Fri Mar 07 2025 - 08:35:19 EST


On Fri, Mar 7, 2025 at 1:12 PM Lorenzo Stoakes
<lorenzo.stoakes@xxxxxxxxxx> wrote:
>
> On Thu, Mar 06, 2025 at 01:49:48PM -0800, Yang Shi wrote:
> > LKP reported 800% performance improvement for small-allocs benchmark
> > from vm-scalability [1] with patch ("/dev/zero: make private mapping
> > full anonymous mapping") [2], but the patch was nack'ed since it changes
> > the output of smaps somewhat.
>
> Yeah sorry about that, but unfortunately something we really do have to
> think about (among other things, the VMA edge cases are always the source
> of weirdness...)
>
> >
> > The profiling shows one of the major sources of the performance
> > improvement is the less contention to i_mmap_rwsem.
>
> Great work tracking that down! Sorry I lost track of the other thread.
>
> >
> > The small-allocs benchmark creates a lot of 40K size memory maps by
> > mmap'ing private /dev/zero then triggers page fault on the mappings.
> > When creating private mapping for /dev/zero, the anonymous VMA is
> > created, but it has valid vm_file. Kernel basically assumes anonymous
> > VMAs should have NULL vm_file, for example, mmap inserts VMA to the file
> > rmap tree if vm_file is not NULL. So the private /dev/zero mapping
> > will be inserted to the file rmap tree, this resulted in the contention
> > to i_mmap_rwsem. But it is actually anonymous VMA, so it is pointless
> > to insert it to file rmap tree.
>
> Ughhhh god haha.
>
> >
> > Skip anonymous VMA for this case. Over 400% performance improvement was
> > reported [3].
>
> That's insane. Amazing work.
>

Ok, so the real question (to Yang) is: who are these /dev/zero users
that require an insane degree of scalability, and why didn't they
switch to regular MAP_ANONYMOUS? Are they in the room with us?

> >
> > It is not on par with the 800% improvement from the original patch. It is
> > because page fault handler needs to access some members of struct file
> > if vm_file is not NULL, for example, f_mode and f_mapping. They are in
> > the same cacheline with file refcount. When mmap'ing a file the file
> > refcount is inc'ed and dec'ed, this caused bad cache false sharing
> > problem. The further debug showed checking whether the VMA is anonymous
> > or not can alleviate the problem. But I'm not sure whether it is the
> > best way to handle it, maybe we should consider shuffle the layout of
> > struct file.
>
> Interesting, I guess you'll take a look at this also?

... And this is probably a non-issue in 99% of !/dev/zero mmaps unless
it's something like libc.so.6 at an insane rate of execs/second.

This seems like a patch in search of a problem and I really don't see
why we should wart up the mmap code otherwise. Not that I have a huge
problem with this patch, which is somewhat simple and obvious.
It'd be great if there was a real workload driving this rather than
useless synthetic benchmarks.

--
Pedro