[RFC PATCH] userfaultfd: Address race after fault.

From: Brian Geffon
Date: Fri Feb 14 2020 - 17:59:14 EST


The faulting path is:
do_user_addr_fault (flags = FAULT_FLAG_ALLOW_RETRY)
handle_mm_fault
__handle_mm_fault
do_anonymous
handle_userfault (ret = VM_FAULT_RETRY)

At this point the fault is handled and when this call sequence unwinds
it is expected that the PTEs are present as handle_userfault took care
of that and returned VM_FAULT_RETRY. When we unwind back down to
do_user_addr_fault it will attempt to retry after clearing
FAULT_FLAG_ALLOW_RETRY and setting FAULT_FLAG_TRIED.

At any point between the fault being handle and when it's retried a
userspace thread was to zap the page range, let's say via
madvise(MADV_DONTNEED). Now as this fault is being retried the PTEs would
be missing again and we land right back in handle_userfault. Although this
time since FAULT_FLAG_ALLOW_RETRY has been cleared we're going to bail and
return VM_FAULT_SIGBUS.

This scenario is easy to reproduce and I observed it while writing tests for
MREMAP_DONTUNMAP in the userfaultfd selftests. I have a standalone example of
this that uses madvise(MADV_DONTNEED) to cause this race:
https://gist.github.com/bgaff/3a8b2a890ae4771be22456e014c2e5aa

Given that this is genuinely a new fault, is a SIGBUS the best way to go about
this? Since userfaultfd is designed to be used in a non-cooperative case, could
it be more resilient and instead retry for the new fault?

With that being said for VM_UFFD_MISSING userfaults can we just remove the
check in handle_userfault that FAULT_FLAG_ALLOW_RETRY is set in vmf->flags
and allow it to retry the fault to address this race scenario?

In my testing that does solve the problem, but does it possibly create others?
Is this the best way to go about it? I'll defer to the domain experts for any
recommendations here in case I'm way off.

Thanks for your time.

Signed-off-by: Brian Geffon <bgeffon@xxxxxxxxxx>
---
fs/userfaultfd.c | 28 ----------------------------
1 file changed, 28 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index ebf17d7e1093..6407fec798ff 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -416,34 +416,6 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
goto out;
}

- /*
- * Check that we can return VM_FAULT_RETRY.
- *
- * NOTE: it should become possible to return VM_FAULT_RETRY
- * even if FAULT_FLAG_TRIED is set without leading to gup()
- * -EBUSY failures, if the userfaultfd is to be extended for
- * VM_UFFD_WP tracking and we intend to arm the userfault
- * without first stopping userland access to the memory. For
- * VM_UFFD_MISSING userfaults this is enough for now.
- */
- if (unlikely(!(vmf->flags & FAULT_FLAG_ALLOW_RETRY))) {
- /*
- * Validate the invariant that nowait must allow retry
- * to be sure not to return SIGBUS erroneously on
- * nowait invocations.
- */
- BUG_ON(vmf->flags & FAULT_FLAG_RETRY_NOWAIT);
-#ifdef CONFIG_DEBUG_VM
- if (printk_ratelimit()) {
- printk(KERN_WARNING
- "FAULT_FLAG_ALLOW_RETRY missing %x\n",
- vmf->flags);
- dump_stack();
- }
-#endif
- goto out;
- }
-
/*
* Handle nowait, not much to do other than tell it to retry
* and wait.
--
2.25.0.265.gbab2e86ba0-goog