Re: [PATCH RESEND] userfaultfd: snapshot VMA state across UFFDIO_COPY retry

From: David Hildenbrand (Arm)

Date: Wed May 20 2026 - 09:48:17 EST


On 5/20/26 14:53, Mike Rapoport wrote:
> On Wed, May 20, 2026 at 01:09:04PM +0200, David Hildenbrand (Arm) wrote:
>> On 5/19/26 07:25, Mike Rapoport wrote:
>>> From: "Mike Rapoport (Microsoft)" <rppt@xxxxxxxxxx>
>>>
>>> mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and
>>> reacquires it afterwards. The destination VMA can be replaced during that
>>> window.
>>>
>>> The existing check compares vma_uffd_ops() before and after the retry, but
>>> if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with
>>> MAP_PRIVATE (or vice versa) the replacement goes undetected.
>>>
>>> The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated
>>> with shmem_alloc_folio() as anonymous and this will cause BUG() when
>>> mfill_atomic_install_pte() will try to folio_add_new_anon_rmap().
>>>
>>> The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into
>>> the page cache of the original VMA.
>>>
>>> Introduce helpers for more comprehensive comparison of VMA state:
>>> - vma_snapshot_get() to save the relevant VMA state into a struct
>>> vma_snapshot (original uffd_ops, actual uffd_ops, relevant VMA flags,
>>> vm_file and pgoff) before dropping the lock
>>> - vma_snapshot_changed() to compare the saved state with the state of the
>>> VMA acquired after retaking the locks
>>> - vma_snapshot_put() to release vm_file pinning.
>>>
>>> Use DEFINE_FREE() cleanup to wrap vma_snapshot_put() to avoid complicating
>>> error handling paths in mfill_copy_folio_retry().
>>>
>>> Add vma_uffd_copy_ops() to avoid code duplication when original ops of
>>> shmem VMA with MAP_PRIVATE are replaced with anon_uffd_ops.
>>>
>>> Fixes: 292411fda25b ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()")
>>> Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic")
>>> Tested-by: Heechan Kang <gganji11@xxxxxxxxx>
>>> Suggested-by: Peter Xu <peterx@xxxxxxxxxx>
>>> Co-developed-by: David Carlier <devnexen@xxxxxxxxx>
>>> Signed-off-by: David Carlier <devnexen@xxxxxxxxx>
>>> Co-developed-by: Michael Bommarito <michael.bommarito@xxxxxxxxx>
>>> Signed-off-by: Michael Bommarito <michael.bommarito@xxxxxxxxx>
>>> Signed-off-by: Mike Rapoport (Microsoft) <rppt@xxxxxxxxxx>
>>> ---
>>> mm/userfaultfd.c | 99 ++++++++++++++++++++++++++++++++++++++----------
>>> 1 file changed, 79 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
>>> index 180bad42fc79..b70b84776a79 100644
>>> --- a/mm/userfaultfd.c
>>> +++ b/mm/userfaultfd.c
>>> @@ -14,6 +14,8 @@
>>> #include <linux/userfaultfd_k.h>
>>> #include <linux/mmu_notifier.h>
>>> #include <linux/hugetlb.h>
>>> +#include <linux/file.h>
>>> +#include <linux/cleanup.h>
>>> #include <asm/tlbflush.h>
>>> #include <asm/tlb.h>
>>> #include "internal.h"
>>> @@ -69,6 +71,24 @@ static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
>>> return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
>>> }
>>>
>>> +static const struct vm_uffd_ops *vma_uffd_copy_ops(struct vm_area_struct *vma)
>>> +{
>>> + const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
>>> +
>>> + if (!ops)
>>> + return NULL;
>>> +
>>> + /*
>>> + * UFFDIO_COPY fills MAP_PRIVATE file-backed mappings as anonymous
>>> + * memory. This is an effective ops override, so retry validation must
>>> + * compare the override result, not just vma->vm_ops->uffd_ops.
>>> + */
>>> + if (!(vma->vm_flags & VM_SHARED))
>>> + return &anon_uffd_ops;
>>> +
>>> + return ops;
>>> +}
>>> +
>>> static __always_inline
>>> bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
>>> {
>>> @@ -443,14 +463,70 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
>>> return ret;
>>> }
>>>
>>> +#define VMA_SNAPSHOT_FLAGS append_vma_flags(__VMA_UFFD_FLAGS, VMA_SHARED_BIT)
>>> +
>>> +struct vma_snapshot {
>>> + const struct vm_uffd_ops *copy_ops;
>>> + const struct vm_uffd_ops *ops;
>>> + struct file *file;
>>> + vma_flags_t flags;
>>> + pgoff_t pgoff;
>>> +};
>>
>> As this is all uffd specific, I wonder whether that should be "struct
>> uffd_vma_snapshot"/"vma_uffd_snapshot" etc.
>
> It's local to this file, I wouldn't worry about namespacing until there's
> an actual need if it will ever arise.

Indeed, fair enough.

>
>> From a high level, this LGTM.
>>
>> I wish we could identify relevant VMA changes more easily. Like, using a per-MM
>> sequence counter that we simply increment on any VMA changes.
>
> Do you mean per-VMA?
> Per-MM counter would capture unrelated changes, e.g. an masvise() for
> unrelated range.

I was thinking about a per-MM thing for simplicity. If there were any changes,
we'd retry (-EAGAIN).

IOW, something like &mm->mm_lock_seq, which we have for per-vma locks already.

Not sure if unrelated changes would really be a problem in practice (another
thread gabbing the mmap_lock in write mode until we serviced the fault).

>
> We kinda have an infrastructure to detect VMA changes that affect uffd
> operation for non-cooperative, we can also hook on that even if
> UFFD_EVENT_* are not requested, but that's way more involved that this VMA
> snapshot.

Yeah, that sounds too complicated and what you have here is simpler.

--
Cheers,

David