Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: Jann Horn
Date: Tue May 28 2019 - 19:20:18 EST
On Wed, May 29, 2019 at 12:28 AM David Howells <dhowells@xxxxxxxxxx> wrote:
> Jann Horn <jannh@xxxxxxxxxx> wrote:
> > I don't see you setting any special properties on the VMA that would
> > prevent userspace from extending its size via mremap() - no
> > VM_DONTEXPAND or VM_PFNMAP. So I think you might get an out-of-bounds
> > access here?
>
> Should I just set VM_DONTEXPAND in watch_queue_mmap()? Like so:
>
> vma->vm_flags |= VM_DONTEXPAND;
Yeah, that should work.
> > > +static void watch_queue_map_pages(struct vm_fault *vmf,
> > > + pgoff_t start_pgoff, pgoff_t end_pgoff)
> > ...
> >
> > Same as above.
>
> Same solution as above? Or do I need ot check start/end_pgoff too?
Same solution.
> > > +static int watch_queue_mmap(struct file *file, struct vm_area_struct *vma)
> > > +{
> > > + struct watch_queue *wqueue = file->private_data;
> > > +
> > > + if (vma->vm_pgoff != 0 ||
> > > + vma->vm_end - vma->vm_start > wqueue->nr_pages * PAGE_SIZE ||
> > > + !(pgprot_val(vma->vm_page_prot) & pgprot_val(PAGE_SHARED)))
> > > + return -EINVAL;
> >
> > This thing should probably have locking against concurrent
> > watch_queue_set_size()?
>
> Yeah.
>
> static int watch_queue_mmap(struct file *file,
> struct vm_area_struct *vma)
> {
> struct watch_queue *wqueue = file->private_data;
> struct inode *inode = file_inode(file);
> u8 nr_pages;
>
> inode_lock(inode);
> nr_pages = wqueue->nr_pages;
> inode_unlock(inode);
>
> if (nr_pages == 0 ||
> ...
> return -EINVAL;
Looks reasonable.
> > > + smp_store_release(&buf->meta.head, len);
> >
> > Why is this an smp_store_release()? The entire buffer should not be visible to
> > userspace before this setup is complete, right?
>
> Yes - if I put the above locking in the mmap handler. OTOH, it's a one-off
> barrier...
>
> > > + if (wqueue->buffer)
> > > + return -EBUSY;
> >
> > The preceding check occurs without any locks held and therefore has no
> > reliable effect. It should probably be moved below the
> > inode_lock(...).
>
> Yeah, it can race. I'll move it into watch_queue_set_size().
Sounds good.
> > > +static void free_watch(struct rcu_head *rcu)
> > > +{
> > > + struct watch *watch = container_of(rcu, struct watch, rcu);
> > > +
> > > + put_watch_queue(rcu_access_pointer(watch->queue));
> >
> > This should be rcu_dereference_protected(..., 1).
>
> That shouldn't be necessary. rcu_access_pointer()'s description says:
>
> * It is also permissible to use rcu_access_pointer() when read-side
> * access to the pointer was removed at least one grace period ago, as
> * is the case in the context of the RCU callback that is freeing up
> * the data, ...
>
> It's in an rcu callback function, so accessing the __rcu pointers in the RCU'd
> struct should be fine with rcu_access_pointer().
Aah, whoops, you're right, I missed that paragraph in the
documentation of rcu_access_pointer().
> > > + /* We don't need the watch list lock for the next bit as RCU is
> > > + * protecting everything from being deallocated.
> >
> > Does "everything" mean "the wqueue" or more than that?
>
> Actually, just 'wqueue' and its buffer. 'watch' is held by us once we've
> dequeued it as we now own the ref 'wlist' had on it. 'wlist' and 'wq' must be
> pinned by the caller.
>
> > > + if (release) {
> > > + if (wlist->release_watch) {
> > > + rcu_read_unlock();
> > > + /* This might need to call dput(), so
> > > + * we have to drop all the locks.
> > > + */
> > > + wlist->release_watch(wlist, watch);
> >
> > How are you holding a reference to `wlist` here? You got the reference through
> > rcu_dereference(), you've dropped the RCU read lock, and I don't see anything
> > that stabilizes the reference.
>
> The watch record must hold a ref on the watched object if the watch_list has a
> ->release_watch() method. In the code snippet above, the watch record now
> belongs to us because we unlinked it under the wlist->lock some lines prior.
Ah, of course.
> However, you raise a good point, and I think the thing to do is to cache
> ->release_watch from it and not pass wlist into (*release_watch)(). We don't
> need to concern ourselves with cleaning up *wlist as it will be cleaned up
> when the target object is removed.
>
> Keyrings don't have a ->release_watch method and neither does the block-layer
> notification stuff.
>
> > > + if (wqueue->pages && wqueue->pages[0])
> > > + WARN_ON(page_ref_count(wqueue->pages[0]) != 1);
> >
> > Is there a reason why there couldn't still be references to the pages
> > from get_user_pages()/get_user_pages_fast()?
>
> I'm not sure. I'm not sure what to do if there are. What do you suggest?
I would use put_page() instead of manually freeing it; I think that
should be enough? I'm not entirely sure though.
> > > + n->info &= (WATCH_INFO_LENGTH | WATCH_INFO_TYPE_FLAGS | WATCH_INFO_ID);
> >
> > Should the non-atomic modification of n->info
>
> n's an unpublished copy of some userspace data that's local to the function
> instance. There shouldn't be any way to race with it at this point.
Ah, derp. Yes.