Re: [PATCH v3] userfaultfd/shmem: fix MCOPY_ATOMIC_CONTNUE behavior

From: Axel Rasmussen
Date: Tue Mar 30 2021 - 19:31:56 EST


On Tue, Mar 30, 2021 at 1:55 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Mon, Mar 29, 2021 at 04:41:31PM -0700, Axel Rasmussen wrote:
> > Previously, we shared too much of the code with COPY and ZEROPAGE, so we
> > manipulated things in various invalid ways:
> >
> > - Previously, we unconditionally called shmem_inode_acct_block. In the
> > continue case, we're looking up an existing page which would have been
> > accounted for properly when it was allocated. So doing it twice
> > results in double-counting, and eventually leaking.
> >
> > - Previously, we made the pte writable whenever the VMA was writable.
> > However, for continue, consider this case:
> >
> > 1. A tmpfs file was created
> > 2. The non-UFFD-registered side mmap()-s with MAP_SHARED
> > 3. The UFFD-registered side mmap()-s with MAP_PRIVATE
> >
> > In this case, even though the UFFD-registered VMA may be writable, we
> > still want CoW behavior. So, check for this case and don't make the
> > pte writable.
> >
> > - The initial pgoff / max_off check isn't necessary, so we can skip past
> > it. The second one seems likely to be unnecessary too, but keep it
> > just in case. Modify both checks to use pgoff, as offset is equivalent
> > and not needed.
> >
> > - Previously, we unconditionally called ClearPageDirty() in the error
> > path. In the continue case though, since this is an existing page, it
> > might have already been dirty before we started touching it. It's very
> > problematic to clear the bit incorrectly, but not a problem to leave
> > it - so, just omit the ClearPageDirty() entirely.
> >
> > - Previously, we unconditionally removed the page from the page cache in
> > the error path. But in the continue case, we didn't add it - it was
> > already there because the page is present in some second
> > (non-UFFD-registered) mapping. So, removing it is invalid.
> >
> > Because the error handling issues are easy to exercise in the selftest,
> > make a small modification there to do so.
> >
> > Finally, refactor shmem_mcopy_atomic_pte a bit. By this point, we've
> > added a lot of "if (!is_continue)"-s everywhere. It's cleaner to just
> > check for that mode first thing, and then "goto" down to where the parts
> > we actually want are. This leaves the code in between cleaner.
> >
> > Changes since v2:
> > - Drop the ClearPageDirty() entirely, instead of trying to remember the
> > old value.
> > - Modify both pgoff / max_off checks to use pgoff. It's equivalent to
> > offset, but offset wasn't initialized until the first check (which
> > we're skipping).
> > - Keep the second pgoff / max_off check in the continue case.
> >
> > Changes since v1:
> > - Refactor to skip ahead with goto, instead of adding several more
> > "if (!is_continue)".
> > - Fix unconditional ClearPageDirty().
> > - Don't pte_mkwrite() when is_continue && !VM_SHARED.
> >
> > Fixes: 00da60b9d0a0 ("userfaultfd: support minor fault handling for shmem")
> > Signed-off-by: Axel Rasmussen <axelrasmussen@xxxxxxxxxx>
> > ---
> > mm/shmem.c | 60 +++++++++++++-----------
> > tools/testing/selftests/vm/userfaultfd.c | 12 +++++
> > 2 files changed, 44 insertions(+), 28 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index d2e0e81b7d2e..fbcce850a16e 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2377,18 +2377,22 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> > struct page *page;
> > pte_t _dst_pte, *dst_pte;
> > int ret;
> > - pgoff_t offset, max_off;
> > -
> > - ret = -ENOMEM;
> > - if (!shmem_inode_acct_block(inode, 1))
> > - goto out;
> > + pgoff_t max_off;
> > + int writable;
>
> Nit: can be bool.
>
> [...]
>
> > +install_ptes:
> > _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> > - if (dst_vma->vm_flags & VM_WRITE)
> > + /* For CONTINUE on a non-shared VMA, don't pte_mkwrite for CoW. */
> > + writable = is_continue && !(dst_vma->vm_flags & VM_SHARED)
> > + ? 0
> > + : dst_vma->vm_flags & VM_WRITE;
>
> Nit: this code is slightly hard to read.. I'd slightly prefer "if
> (is_continue)...". But more below.
>
> > + if (writable)
> > _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
> > else {
> > /*
> > @@ -2455,7 +2458,7 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> >
> > ret = -EFAULT;
> > max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> > - if (unlikely(offset >= max_off))
> > + if (unlikely(pgoff >= max_off))
> > goto out_release_unlock;
> >
> > ret = -EEXIST;
> > @@ -2485,13 +2488,14 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> > return ret;
> > out_release_unlock:
> > pte_unmap_unlock(dst_pte, ptl);
> > - ClearPageDirty(page);
> > - delete_from_page_cache(page);
> > + if (!is_continue)
> > + delete_from_page_cache(page);
> > out_release:
> > unlock_page(page);
> > put_page(page);
> > out_unacct_blocks:
> > - shmem_inode_unacct_blocks(inode, 1);
> > + if (!is_continue)
> > + shmem_inode_unacct_blocks(inode, 1);
>
> If you see we still have tons of "if (!is_continue)". Those are the places
> error prone.. even if not in this patch, could be in the patch when this
> function got changed again.
>
> Sorry to say this a bit late: how about introduce a helper to install the pte?

No worries. :)

> Pesudo code:
>
> int shmem_install_uffd_pte(..., bool writable)
> {
> ...
> _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> if (dst_vma->vm_flags & VM_WRITE)
> _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
> else
> set_page_dirty(page);
>
> dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> if (!pte_none(*dst_pte)) {
> pte_unmap_unlock(dst_pte, ptl);
> return -EEXIST;
> }
>
> inc_mm_counter(dst_mm, mm_counter_file(page));
> page_add_file_rmap(page, false);
> set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
>
> /* No need to invalidate - it was non-present before */
> update_mmu_cache(dst_vma, dst_addr, dst_pte);
> pte_unmap_unlock(dst_pte, ptl);
> return 0;
> }
>
> Then at the entry of shmem_mcopy_atomic_pte():
>
> if (is_continue) {
> page = find_lock_page(mapping, pgoff);
> if (!page)
> return -EFAULT;
> ret = shmem_install_uffd_pte(...,
> is_continue && !(dst_vma->vm_flags & VM_SHARED));
> unlock_page(page);
> if (ret)
> put_page(page);
> return ret;
> }
>
> Do you think this would be cleaner?

Yes, a refactor like that is promising. It's hard to say for certain
without actually looking at the result - I'll spend some time tomorrow
on a few options, and send along the cleanest version I come up with.

Thanks for all the feedback and advice on this feature, Peter!

>
> --
> Peter Xu
>