Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

From: Liam R. Howlett
Date: Thu Apr 11 2024 - 14:14:31 EST


* Peter Xu <peterx@xxxxxxxxxx> [240410 13:06]:
> anon_vma is a tricky object in the context of per-vma lock, because it's
> racy to modify it in that context and mmap lock is needed if it's not
> stable yet.

How is it racy in your view? There isn't a way to get in a situation
where its state isn't certain, is there?

>
> So far there are three places that sanity checks anon_vma for that:
>
> - lock_vma_under_rcu(): this is the major entrance of per-vma lock, where
> we have taken care of anon memory v.s. potential anon_vma allocations.

Well, not exactly. Single vma faults vs potential modifications beyond
one vma (including uses of adjacent vmas with anon_vmas)

>
> - lock_vma(): even if it looks so generic as an API, it's only used in
> userfaultfd context to leverage per-vma locks. It does extra check
> over MAP_PRIVATE file mappings for the same anon_vma issue.

This static function is in mm/userfaultfd so, yes, it's not generic -
the name is generic, but I didn't see a reason to hold up the patch for
a static name that is descriptive. It's static so I'm not concerned
about usage growing.

I would expect such a check will eventually need to be moved to common
code, and then potentially change the name to something more
descriptive. This seems premature with a single user though.

>
> - vmf_anon_prepare(): it works for private file mapping faults just like
> what lock_vma() wanted to cover above. One trivial difference is in
> some extremely corner case, the fault handler will still allow per-vma
> fault to happen, like a READ on a privately mapped file.

What is happening here is that we are detecting when the virtual memory
space is stable vs when the vma is stable. In some cases, when we need
to check prev/next, then we need to make sure the virtual memory space
is stable. In other cases, the vma is all that matters to the operation
and so we can continue without stopping anyone else from modifying the
virtual memory space - that is, we are allowing writes in other areas.

>
> The question is whether that's intended to make it as complicated. Per my
> question in the thread, it is not intended, and Suren also seems to agree [1].
>
> So the trivial side effect of such patch is:
>
> - We may do slightly better on the first WRITE of a private file mapping,
> because we can retry earlier (in lock_vma_under_rcu(), rather than
> vmf_anon_prepare() later).
>
> - We may always use mmap lock for the initial READs on a private file
> mappings, while before this patch it _can_ (only when no WRITE ever
> happened... but it doesn't make much sense for a MAP_PRIVATE..) do the
> read fault with per-vma lock.
>
> Then noted that right after any WRITE the anon_vma will be stablized, then
> there will be no difference.

In my view, the anon_vma is always stable. During a write it is
initialised.

>And I believe that should be the majority
> cases too; I also did try to run a program, writting to MAP_PRIVATE file
> memory (that I pre-headed in the page cache) and I can hardly measure a
> difference in performance.
>
> Let's simply ignore all those trivial corner cases and unify the anon_vma
> check from three places into one. I also didn't check the rest users of
> lock_vma_under_rcu(), where in a !fault context it could even fix something
> that used to race with private file mappings but I didn't check further.

This change will increase mmap semaphore contention. I'd like to move
to a more common structured layout, but I think we need to find a way to
do this without failing the lock_vma_under_rcu() more frequently. In
fact, we need to be looking for ways to make it fail less.

The UFFD code is more strict on what is acceptable for a per-vma lock
[1]. This user has a restriction that will decrease the benefits of the
per-vma lock, but we shouldn't make this case the common one.

As I'm sure you are aware, the page fault path is the most common path
for using per-vma locks and should minimize taking the mmap lock as much
as possible.

I don't think we should increase the lock contention to simplify the
code.

Thanks,
Liam

[1] https://lore.kernel.org/lkml/CAG48ez0AdTijvuh0xueg_spwNE9tVcPuvqT9WpvmtiNNudQFMw@xxxxxxxxxxxxxx/