Re: [PATCH] vfio/type1: Retry follow_pfnmap_start() when PFNMAP is zapped

From: Alex Williamson

Date: Thu Mar 19 2026 - 10:35:32 EST

On Thu, 19 Mar 2026 14:18:49 +0100
"David Hildenbrand (Arm)" <david@xxxxxxxxxx> wrote:

> On 3/19/26 09:36, Boone, Max wrote:
> >
> >
> >> On Mar 18, 2026, at 10:22 PM, Alex Williamson <alex@xxxxxxxxxxx> wrote:
> >>
> >> […]
> >>
> >>> + /*
> >>> + * follow_pfnmap_start() returns -EINVAL for
> >>> + * invalid parameters and non-present entries.
> >>> + * If that happens here after a successful
> >>> + * fixup_user_fault(), it is likely that the
> >>> + * pfnmap has been zapped. Retry instead of
> >>> + * failing.
> >>> + */
> >>
> >> It's a little stronger than that, right? We're betting that the only
> >> remaining non-zero return is due to a race and we can introduce what
> >> appears to be potential for an infinite loop here because -EAGAIN will
> >> get kicked out to redo the vma_lookup() and fixup_user_fault() should
> >> return a genuine error if we're completely in the weeds. Should we
> >> make this a little stronger and more specific? Thanks,
> >
> > I’d say that the best case would be to have follow_pfnmap_start() return
> > -EINVAL or -ENOENT w.r.t. which of the two return values it is. But then
> > again, we could theoretically run into an infinite loop I guess - as the zap
> > and faulting could run in lockstep (the race window is extremely small
> > though).
>
> Well, in theory :) To hit that race repeatedly, you'd really have to be
> quite lucky I guess.
>
> But the real question is: if user space triggered the pinning, and user
> space keeps hurting itself to make progress, is that a real problem?

I'd say no, that's not a problem. It's really just that masking all
non-zero returns as -EAGAIN makes some assumptions about the other
error conditions that could change over time, so minimally those
dependencies should be clearly stated. Even better would be if we
didn't need to make those assumptions. Thanks,

Alex

> I guess the crucial part would be to
>
> a) Have some cond_resched(() in there?
> b) Checking for fatal signals somewhere?
> c) Possibly drop locks (mmap lock?) every now and then?
>
> For GUP, a) and b) are in place in __get_user_pages().
>
> c) might be done, but I think it's less deterministic.
>
> >
> > We could make the retry above bounded, and bubble up a -EBUSY such
> > that users of the ioctl can decide to retry instead of fail?
>
> Would that be a possible ABI break? You'd really have to only do that in
> a case where user space does stupid things, I guess.
>
> >
> > David, you mentioned that gup already has retry logic that we don’t have
> > with follow_fault_pfn() -> follow_pfnmap_start(). Would we potentially run
> > into an infinite loop with this change?
>
> GUP triggers page faults through faultin_page(). If handle_mm_fault()
> returns
>
> * VM_FAULT_COMPLETED we return -EAGAIN
> * VM_FAULT_ERROR we return the error
> * VM_FAULT_RETRY we return -EBUSY
> * Otherwise 0
>
> In the caller __get_user_pages(), we
> * Retry immediately with ret == 0
> * Return to the GUP caller (letting it retry) with -EBUSY/-EAGAIN
>
> Having at least a) and b) sounds reasonable. Not sure about having c),
> might be tricky if we are not allowed to drop the lock.
>