Re: [PATCH 3/4] mm: replace rw_semaphore with atomic_t in vma_lock

From: Suren Baghdasaryan
Date: Thu Dec 12 2024 - 09:18:04 EST


On Thu, Dec 12, 2024 at 1:17 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Wed, Dec 11, 2024 at 07:01:16PM -0800, Suren Baghdasaryan wrote:
>
> > > > > I think your proposal should work. Let me try to code it and see if
> > > > > something breaks.
> >
> > Ok, I tried it out and things are a bit more complex:
> > 1. We should allow write-locking a detached VMA, IOW vma_start_write()
> > can be called when vm_refcnt is 0.
>
> This sounds dodgy, refcnt being zero basically means the object is dead
> and you shouldn't be touching it no more. Where does this happen and
> why?
>
> Notably, it being 0 means it is no longer in the mas tree and can't be
> found anymore.

It happens when a newly created vma that was not yet attached
(vma->vm_refcnt = 0) is write-locked before being added into the vma
tree. For example:
mmap()
mmap_write_lock()
vma = vm_area_alloc() // vma->vm_refcnt = 0 (detached)
//vma attributes are initialized
vma_start_write() // write 0x8000 0001 into vma->vm_refcnt
mas_store_gfp()
vma_mark_attached()
mmap_write_lock() // vma_end_write_all()

In this scenario, we write-lock the VMA before adding it into the tree
to prevent readers (pagefaults) from using it until we drop the
mmap_write_lock(). In your proposal, the first thing vma_start_write()
does is add(0x8000'0001) and that will trigger a warning.
For now instead of add(0x8000'0001) I can play this game to avoid the warning:

if (refcount_inc_not_zero(&vma->vm_refcnt))
refcount_add(0x80000000, &vma->vm_refcnt);
else
refcount_set(&vma->vm_refcnt, 0x80000001);

this refcount_set() works because vma with vm_refcnt==0 could not be
found by readers. I'm not sure this will still work when we add
TYPESAFE_BY_RCU and introduce vma reuse possibility.

>
> > 2. Adding 0x80000000 saturates refcnt, so I have to use a lower bit
> > 0x40000000 to denote writers.
>
> I'm confused, what? We're talking about atomic_t, right?

I thought you suggested using refcount_t. According to
https://elixir.bootlin.com/linux/v6.13-rc2/source/include/linux/refcount.h#L22
valid values would be [0..0x7fff_ffff] and 0x80000000 is outside of
that range. What am I missing?

>
> > 3. Currently vma_mark_attached() can be called on an already attached
> > VMA. With vma->detached being a separate attribute that works fine but
> > when we combine it with the vm_lock things break (extra attach would
> > leak into lock count). I'll see if I can catch all the cases when we
> > do this and clean them up (not call vma_mark_attached() when not
> > necessary).
>
> Right, I hadn't looked at that thing in detail, that sounds like it
> needs a wee cleanup like you suggest.

Yes, I'll embark on that today. Will see how much of a problem that is.