Re: [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl under RCU
From: Andrii Nakryiko
Date: Thu May 01 2025 - 18:10:40 EST
On Tue, Apr 29, 2025 at 10:25 AM Jann Horn <jannh@xxxxxxxxxx> wrote:
>
> On Tue, Apr 29, 2025 at 7:15 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > On Tue, Apr 29, 2025 at 8:56 AM Jann Horn <jannh@xxxxxxxxxx> wrote:
> > > On Wed, Apr 23, 2025 at 12:54 AM Andrii Nakryiko
> > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > > > > Utilize speculative vma lookup to find and snapshot a vma without
> > > > > taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
> > > > > address space modifications are detected and the lookup is retried.
> > > > > While we take the mmap_lock for reading during such contention, we
> > > > > do that momentarily only to record new mm_wr_seq counter.
> > > >
> > > > PROCMAP_QUERY is an even more obvious candidate for fully lockless
> > > > speculation, IMO (because it's more obvious that vma's use is
> > > > localized to do_procmap_query(), instead of being spread across
> > > > m_start/m_next and m_show as with seq_file approach). We do
> > > > rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA (no
> > > > mmap_read_lock), use that VMA to produce (speculative) output, and
> > > > then validate that VMA or mm_struct didn't change with
> > > > mmap_lock_speculate_retry(). If it did - retry, if not - we are done.
> > > > No need for vma_copy and any gets/puts, no?
> > >
> > > I really strongly dislike this "fully lockless" approach because it
> > > means we get data races all over the place, and it gets hard to reason
> > > about what happens especially if we do anything other than reading
> > > plain data from the VMA. When reading the implementation of
> > > do_procmap_query(), at basically every memory read you'd have to think
> > > twice as hard to figure out which fields can be concurrently updated
> > > elsewhere and whether the subsequent sequence count recheck can
> > > recover from the resulting badness.
> > >
> > > Just as one example, I think get_vma_name() could (depending on
> > > compiler optimizations) crash with a NULL deref if the VMA's ->vm_ops
> > > pointer is concurrently changed to &vma_dummy_vm_ops by vma_close()
> > > between "if (vma->vm_ops && vma->vm_ops->name)" and
> > > "vma->vm_ops->name(vma)". And I think this illustrates how the "fully
> > > lockless" approach creates more implicit assumptions about the
> > > behavior of core MM code, which could be broken by future changes to
> > > MM code.
> >
> > Yeah, I'll need to re-evaluate such an approach after your review. I
> > like having get_stable_vma() to obtain a completely stable version of
> > the vma in a localized place and then stop worrying about possible
> > races. If implemented correctly, would that be enough to address your
> > concern, Jann?
>
> Yes, I think a stable local snapshot of the VMA (where tricky data
> races are limited to the VMA snapshotting code) is a good tradeoff.
I'm not sure I agree with VMA snapshot being better either, tbh. It is
error-prone to have a byte-by-byte local copy of VMA (which isn't
really that VMA anymore), and passing it into ops callbacks (which
expect "real" VMA)... Who guarantees that this won't backfire,
depending on vm_ops implementations? And constantly copying 176+ bytes
just to access a few fields out of it is a bit unfortunate...
Also taking mmap_read_lock() sort of defeats the point of "RCU-only
access". It's still locking/unlocking and bouncing cache lines between
writer and reader frequently. How slow is per-VMA formatting? If we
take mmap_read_lock, format VMA information into a buffer under this
lock, and drop the mmap_read_lock, would it really be that much slower
compared to what Suren is doing in this patch set? And if no, that
would be so much simpler compared to this semi-locked/semi-RCU way
that is added in this patch set, no?
But I do agree that vma->vm_ops->name access is hard to do in a
completely lockless way reliably. But also how frequently VMAs have
custom names/anon_vma_name? What if we detect that VMA has some
"fancy" functionality (like this custom name thing), and just fallback
to mmap_read_lock-protected logic, which needs to be supported as a
fallback even for lockless approach?
This way we can process most (typical) VMAs completely locklessly,
while not adding any extra assumptions for all the potentially
complicated data pieces. WDYT?