Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings

From: Andrea Arcangeli
Date: Thu Feb 16 2017 - 13:41:19 EST

On Wed, Feb 15, 2017 at 01:46:50PM -0800, Mike Kravetz wrote:
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d0d1d08..41f6c51 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4029,6 +4029,18 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> __SetPageUptodate(page);
> set_page_huge_active(page);
> + /*
> + * If shared, add to page cache
> + */
> + if (dst_vma->vm_flags & VM_SHARED) {

Minor nitpick, this could be a:

int vm_shared = dst_vma->vm_flags & VM_SHARED;

(int faster than bool here as VM_SHARED won't have to be converted into 0|1)

> @@ -386,7 +413,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> goto out_unlock;
> err = -EINVAL;
> - if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED)
> + if (!vma_is_shmem(dst_vma) && !is_vm_hugetlb_page(dst_vma) &&
> + dst_vma->vm_flags & VM_SHARED)
> goto out_unlock;

Other minor nitpick, this could have been faster as:

if (vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED)

Thinking twice, the only case we need to rule out is shmem_zero_setup
(it's not anon vmas can be really VM_SHARED or they wouldn't be anon
vmas in the first place) so even the above is superfluous because
shmem_zero_setup does:

vma->vm_ops = &shmem_vm_ops;

So I would turn it into:

* shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but
* it will overwrite vm_ops, so vma_is_anonymous must return false.
if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED))
goto out_unlock;

Reviewed-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>

Changing topic: thinking about the last part, I was wondering about
vma_is_anonymous because it's well known issue, it's only guaranteed
fully correct during page faults because the weird cases that can
return a false positive cannot generate page faults and they run
remap_pfn_range before releasing the mmap_sem for writing within
mmap(2) itself.

More than a year ago I discussed with Kirill (CC'ed) the
vma_is_anonymous() false positives: some VM_IO vma leaves vma->vm_ops
NULL, which is generally non concerning because putting an anon page
in there shouldn't have any adverse side effects for the rest of the

It was critical for khugepaged, because khugepaged will run out of the
app own control, so khugepaged must be fully accurate of it'll just
activate on VM_IO special mappings, but that's definitely not the case
for userfaultfd.

Still I was wondering if we could be more strict by adding a
vma->vm_flags & VM_NO_KHUGEPAGED check so that VM_IO regions will not
pass registration (only in fs/userfaultfd.c:vma_can_userfault;
mm/userfaultfd.c doesn't need any of that as it requires registration
to be passed first and vm_userfaultfd_ctx already armed).

A fully accurate vma_is_anonymous could be implemented like this:

vma_is_anonymous && !(vm_flags & VM_NO_KHUGEPAGED)

Which is what khugepaged internally uses.

This will also work for MAP_PRIVATE on /dev/zero (required to work by
the non cooperative case).

Side note: MAP_PRIVATE /dev/zero is very special and it's the only
case in the kernel a "true" anon vma has vm_file set. I think it'd be
cleaner to "decouple" the vma from /dev/zero the same way
MAP_SHARED|MAP_ANON "couples" the vma to a pseudo /dev/zero, so anon
vmas would always have vm_file NULL (not done because it'd risk to
break stuff by changing the /proc/pid/maps output), but that's besides
the point and the above solution deployed already by khugepaged
already works with the current /dev/zero MAP_PRIVATE code.

Kirill what's your take on making the registration checks stricter?
If we would add a vma_is_anonymous_not_in_fault implemented like above
vma_can_userfault would just need a
s/vma_is_anonymous/vma_is_anonymous_not_in_fault/ and it would be more
strict. khugepaged could be then converted to use it too instead of
hardcoding this vm_flags check. Unless I'm mistaken I would consider
such a change to the registration code, purely a cleanup to add a more
strict check.

Alternatively we could just leave things as is and depend on apps
using VM_IO with remap_pfn_range with vm_file set and vm_ops NULL, not
to screw themselves up by filling their regions with anon pages. I
don't see how it could break anything (except themselves) if they do,
but I'd appreciate a second thought on that. And at least for the
remap_pfn_range users it won't even get that far, because a graceful
-EEXIST would be returned instead. The change would effectively
convert a -EEXIST returned later into a stricter -EINVAL returned
early at registration time, for a case that is erratic by design.