Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount

From: Lorenzo Stoakes
Date: Mon Jan 13 2025 - 07:15:24 EST


A nit on subject, I mean this is part of what this series does, and hey -
we have only so much text to put in here - but isn't this both
reimplementing per-VMA lock as a refcount _and_ importantly allocating VMAs
using the RCU typesafe mechanism?

Do we have to do both in one series? Can we split this out? I mean maybe
that's just churny and unnecessary, but to me this series is 'allocate VMAs
RCU safe and refcount VMA lock' or something like this. Maybe this is
nitty... but still :)

One general comment here - this is a really major change in how this stuff
works, and yet I don't see any tests anywhere in the series.

I know it's tricky to write tests for this, but the new VMA testing
environment should make it possible to test a _lot_ more than we previously
could.

However due to some (*ahem*) interesting distribution of where functions
are, most notably stuff in kernel/fork.c, I guess we can't test
_everything_ there effectively.

But I do feel like we should be able to do better than having absolutely no
testing added for this?

I think there's definitely quite a bit you could test now, at least in
asserting fundamentals in tools/testing/vma/vma.c.

This can cover at least detached state asserts in various scenarios.

But that won't cover off the really gnarly stuff here around RCU slab
allocation, and determining precisely how to test that in a sensible way is
maybe less clear.

But I'd like to see _something_ here please, this is more or less
fundamentally changing how all VMAs are allocated and to just have nothing
feels unfortunate.

I'm already nervous because we've hit issues coming up to v9 and we're not
100% sure if a recent syzkaller is related to these changes or not, I'm not
sure how much we can get assurances with tests but I'd like something.

Thanks!

On Fri, Jan 10, 2025 at 08:25:47PM -0800, Suren Baghdasaryan wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> Splitting single logical structure into multiple ones leads to more
> complicated management, extra pointer dereferences and overall less
> maintainable code. When that split-away part is a lock, it complicates
> things even further. With no performance benefits, there are no reasons
> for this split. Merging the vm_lock back into vm_area_struct also allows
> vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> This patchset:
> 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> boundary and changing the cache to be cacheline-aligned to minimize
> cacheline sharing;
> 2. changes vm_area_struct initialization to mark new vma as detached until
> it is inserted into vma tree;
> 3. replaces vm_lock and vma->detached flag with a reference counter;
> 4. regroups vm_area_struct members to fit them into 3 cachelines;
> 5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> reuse and to minimize call_rcu() calls.
>
> Pagefault microbenchmarks show performance improvement:
> Hmean faults/cpu-1 507926.5547 ( 0.00%) 506519.3692 * -0.28%*
> Hmean faults/cpu-4 479119.7051 ( 0.00%) 481333.6802 * 0.46%*
> Hmean faults/cpu-7 452880.2961 ( 0.00%) 455845.6211 * 0.65%*
> Hmean faults/cpu-12 347639.1021 ( 0.00%) 352004.2254 * 1.26%*
> Hmean faults/cpu-21 200061.2238 ( 0.00%) 229597.0317 * 14.76%*
> Hmean faults/cpu-30 145251.2001 ( 0.00%) 164202.5067 * 13.05%*
> Hmean faults/cpu-48 106848.4434 ( 0.00%) 120641.5504 * 12.91%*
> Hmean faults/cpu-56 92472.3835 ( 0.00%) 103464.7916 * 11.89%*
> Hmean faults/sec-1 507566.1468 ( 0.00%) 506139.0811 * -0.28%*
> Hmean faults/sec-4 1880478.2402 ( 0.00%) 1886795.6329 * 0.34%*
> Hmean faults/sec-7 3106394.3438 ( 0.00%) 3140550.7485 * 1.10%*
> Hmean faults/sec-12 4061358.4795 ( 0.00%) 4112477.0206 * 1.26%*
> Hmean faults/sec-21 3988619.1169 ( 0.00%) 4577747.1436 * 14.77%*
> Hmean faults/sec-30 3909839.5449 ( 0.00%) 4311052.2787 * 10.26%*
> Hmean faults/sec-48 4761108.4691 ( 0.00%) 5283790.5026 * 10.98%*
> Hmean faults/sec-56 4885561.4590 ( 0.00%) 5415839.4045 * 10.85%*
>
> Changes since v8 [4]:
> - Change subject for the cover letter, per Vlastimil Babka
> - Added Reviewed-by and Acked-by, per Vlastimil Babka
> - Added static check for no-limit case in __refcount_add_not_zero_limited,
> per David Laight
> - Fixed vma_refcount_put() to call rwsem_release() unconditionally,
> per Hillf Danton and Vlastimil Babka
> - Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
> under us, per Vlastimil Babka
> - Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
> per Vlastimil Babka
> - Changed __vma_enter_locked() parameter to centralize refcount logic,
> per Vlastimil Babka
> - Amended description in vm_lock replacement patch explaining the effects
> of the patch on vm_area_struct size, per Vlastimil Babka
> - Added vm_area_struct member regrouping patch [5] into the series
> - Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
> - Added a comment for vm_area_struct to update vm_area_init_from() when
> adding new members, per Vlastimil Babka
> - Updated a comment about unstable src->shared.rb when copying a vma in
> vm_area_init_from(), per Vlastimil Babka
>
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@xxxxxxxxxx/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@xxxxxxxxxxxxxx/
> [4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@xxxxxxxxxx/
> [5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@xxxxxxxxxx/
>
> Patchset applies over mm-unstable after reverting v8
> (current SHA range: 235b5129cb7b - 9e6b24c58985)
>
> Suren Baghdasaryan (17):
> mm: introduce vma_start_read_locked{_nested} helpers
> mm: move per-vma lock into vm_area_struct
> mm: mark vma as detached until it's added into vma tree
> mm: introduce vma_iter_store_attached() to use with attached vmas
> mm: mark vmas detached upon exit
> types: move struct rcuwait into types.h
> mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> mm: move mmap_init_lock() out of the header file
> mm: uninline the main body of vma_start_write()
> refcount: introduce __refcount_{add|inc}_not_zero_limited
> mm: replace vm_lock and detached flag with a reference count
> mm: move lesser used vma_area_struct members into the last cacheline
> mm/debug: print vm_refcnt state when dumping the vma
> mm: remove extra vma_numab_state_init() call
> mm: prepare lock_vma_under_rcu() for vma reuse possibility
> mm: make vma cache SLAB_TYPESAFE_BY_RCU
> docs/mm: document latest changes to vm_lock
>
> Documentation/mm/process_addrs.rst | 44 ++++----
> include/linux/mm.h | 156 ++++++++++++++++++++++-------
> include/linux/mm_types.h | 75 +++++++-------
> include/linux/mmap_lock.h | 6 --
> include/linux/rcuwait.h | 13 +--
> include/linux/refcount.h | 24 ++++-
> include/linux/slab.h | 6 --
> include/linux/types.h | 12 +++
> kernel/fork.c | 129 +++++++++++-------------
> mm/debug.c | 12 +++
> mm/init-mm.c | 1 +
> mm/memory.c | 97 ++++++++++++++++--
> mm/mmap.c | 3 +-
> mm/userfaultfd.c | 32 +++---
> mm/vma.c | 23 ++---
> mm/vma.h | 15 ++-
> tools/testing/vma/linux/atomic.h | 5 +
> tools/testing/vma/vma_internal.h | 93 ++++++++---------
> 18 files changed, 465 insertions(+), 281 deletions(-)
>
> --
> 2.47.1.613.gc27f4b7a9f-goog
>