Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
From: Michal Hocko
Date: Wed Jan 18 2023 - 16:28:27 EST
On Wed 18-01-23 09:36:44, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team
> <kernel-team@xxxxxxxxxxx> wrote:
> >
> > On Wed 18-01-23 14:23:32, Jann Horn wrote:
> > > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > > > On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > > > > +locking maintainers
> > > > >
> > > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > > > locked.
> > > > > [...]
> > > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > +{
> > > > > > + up_read(&vma->lock);
> > > > > > +}
> > > > >
> > > > > One thing that might be gnarly here is that I think you might not be
> > > > > allowed to use up_read() to fully release ownership of an object -
> > > > > from what I remember, I think that up_read() (unlike something like
> > > > > spin_unlock()) can access the lock object after it's already been
> > > > > acquired by someone else.
> > > >
> > > > Yes, I think you are right. From a look into the code it seems that
> > > > the UAF is quite unlikely as there is a ton of work to be done between
> > > > vma_write_lock used to prepare vma for removal and actual removal.
> > > > That doesn't make it less of a problem though.
> > > >
> > > > > So if you want to protect against concurrent
> > > > > deletion, this might have to be something like:
> > > > >
> > > > > rcu_read_lock(); /* keeps vma alive */
> > > > > up_read(&vma->lock);
> > > > > rcu_read_unlock();
> > > > >
> > > > > But I'm not entirely sure about that, the locking folks might know better.
> > > >
> > > > I am not a locking expert but to me it looks like this should work
> > > > because the final cleanup would have to happen rcu_read_unlock.
> > > >
> > > > Thanks, I have completely missed this aspect of the locking when looking
> > > > into the code.
> > > >
> > > > Btw. looking at this again I have fully realized how hard it is actually
> > > > to see that vm_area_free is guaranteed to sync up with ongoing readers.
> > > > vma manipulation functions like __adjust_vma make my head spin. Would it
> > > > make more sense to have a rcu style synchronization point in
> > > > vm_area_free directly before call_rcu? This would add an overhead of
> > > > uncontended down_write of course.
> > >
> > > Something along those lines might be a good idea, but I think that
> > > rather than synchronizing the removal, it should maybe be something
> > > that splats (and bails out?) if it detects pending readers. If we get
> > > to vm_area_free() on a VMA that has pending readers, we might already
> > > be in a lot of trouble because the concurrent readers might have been
> > > traversing page tables while we were tearing them down or fun stuff
> > > like that.
> > >
> > > I think maybe Suren was already talking about something like that in
> > > another part of this patch series but I don't remember...
> >
> > This http://lkml.kernel.org/r/20230109205336.3665937-27-surenb@xxxxxxxxxx?
>
> Yes, I spent a lot of time ensuring that __adjust_vma locks the right
> VMAs and that VMAs are freed or isolated under VMA write lock
> protection to exclude any readers. If the VM_BUG_ON_VMA in the patch
> Michal mentioned gets hit then it's a bug in my design and I'll have
> to fix it. But please, let's not add synchronize_rcu() in the
> vm_area_free().
Just to clarify. I didn't suggest to add synchronize_rcu into
vm_area_free. What I really meant was synchronize_rcu like primitive to
effectivelly synchronize with any potential pending read locker (so
something like vma_write_lock (or whatever it is called). The point is
that vma freeing is an event all readers should be notified about.
This can be done explicitly for each and every vma before vm_area_free
is called but this is just hard to review and easy to break over time.
See my point?
--
Michal Hocko
SUSE Labs