Re: [PATCH v8 11/16] mm: replace vm_lock and detached flag with a reference count

From: Vlastimil Babka
Date: Fri Jan 10 2025 - 09:33:42 EST


On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> rw_semaphore is a sizable structure of 40 bytes and consumes
> considerable space for each vm_area_struct. However vma_lock has
> two important specifics which can be used to replace rw_semaphore
> with a simpler structure:
> 1. Readers never wait. They try to take the vma_lock and fall back to
> mmap_lock if that fails.
> 2. Only one writer at a time will ever try to write-lock a vma_lock
> because writers first take mmap_lock in write mode.
> Because of these requirements, full rw_semaphore functionality is not
> needed and we can replace rw_semaphore and the vma->detached flag with
> a refcount (vm_refcnt).
> When vma is in detached state, vm_refcnt is 0 and only a call to
> vma_mark_attached() can take it out of this state. Note that unlike
> before, now we enforce both vma_mark_attached() and vma_mark_detached()
> to be done only after vma has been write-locked. vma_mark_attached()
> changes vm_refcnt to 1 to indicate that it has been attached to the vma
> tree. When a reader takes read lock, it increments vm_refcnt, unless the
> top usable bit of vm_refcnt (0x40000000) is set, indicating presence of
> a writer. When writer takes write lock, it sets the top usable bit to
> indicate its presence. If there are readers, writer will wait using newly
> introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
> mode first, there can be only one writer at a time. The last reader to
> release the lock will signal the writer to wake up.
> refcount might overflow if there are many competing readers, in which case
> read-locking will fail. Readers are expected to handle such failures.
> In summary:
> 1. all readers increment the vm_refcnt;
> 2. writer sets top usable (writer) bit of vm_refcnt;
> 3. readers cannot increment the vm_refcnt if the writer bit is set;
> 4. in the presence of readers, writer must wait for the vm_refcnt to drop
> to 1 (ignoring the writer bit), indicating an attached vma with no readers;
> 5. vm_refcnt overflow is handled by the readers.
>
> Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Suggested-by: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx>

Reviewed-by: Vlastimil Babka <vbabka@xxxxxxx>

But think there's a problem that will manifest after patch 15.
Also I don't feel qualified enough about the lockdep parts though
(although I think I spotted another issue with those, below) so best if
PeterZ can review those.
Some nits below too.

> +
> +static inline void vma_refcount_put(struct vm_area_struct *vma)
> +{
> + int oldcnt;
> +
> + if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) {
> + rwsem_release(&vma->vmlock_dep_map, _RET_IP_);

Shouldn't we rwsem_release always? And also shouldn't it precede the
refcount operation itself?

> + if (is_vma_writer_only(oldcnt - 1))
> + rcuwait_wake_up(&vma->vm_mm->vma_writer_wait);

Hmm hmm we should maybe read the vm_mm pointer before dropping the
refcount? In case this races in a way that is_vma_writer_only tests true
but the writer meanwhile finishes and frees the vma. It's safe now but
not after making the cache SLAB_TYPESAFE_BY_RCU ?

> + }
> +}
> +

> static inline void vma_end_read(struct vm_area_struct *vma)
> {
> rcu_read_lock(); /* keeps vma alive till the end of up_read */

This should refer to vma_refcount_put(). But after fixing it I think we
could stop doing this altogether? It will no longer keep vma "alive"
with SLAB_TYPESAFE_BY_RCU.

> - up_read(&vma->vm_lock.lock);
> + vma_refcount_put(vma);
> rcu_read_unlock();
> }
>

<snip>

> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6370,9 +6370,41 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
> #endif
>
> #ifdef CONFIG_PER_VMA_LOCK
> +static inline bool __vma_enter_locked(struct vm_area_struct *vma, unsigned int tgt_refcnt)
> +{
> + /*
> + * If vma is detached then only vma_mark_attached() can raise the
> + * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached().
> + */
> + if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt))
> + return false;
> +
> + rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_);
> + rcuwait_wait_event(&vma->vm_mm->vma_writer_wait,
> + refcount_read(&vma->vm_refcnt) == tgt_refcnt,
> + TASK_UNINTERRUPTIBLE);
> + lock_acquired(&vma->vmlock_dep_map, _RET_IP_);
> +
> + return true;
> +}
> +
> +static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached)
> +{
> + *detached = refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt);
> + rwsem_release(&vma->vmlock_dep_map, _RET_IP_);
> +}
> +
> void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq)
> {
> - down_write(&vma->vm_lock.lock);
> + bool locked;
> +
> + /*
> + * __vma_enter_locked() returns false immediately if the vma is not
> + * attached, otherwise it waits until refcnt is (VMA_LOCK_OFFSET + 1)
> + * indicating that vma is attached with no readers.
> + */
> + locked = __vma_enter_locked(vma, VMA_LOCK_OFFSET + 1);

Wonder if it would be slightly better if tgt_refcount was just 1 (or 0
below in vma_mark_detached()) and the VMA_LOCK_OFFSET added to it in
__vma_enter_locked() itself as it's the one adding it in the first place.