Re: Sealed memfd & no-fault mmap

From: Linus Torvalds
Date: Tue Apr 27 2021 - 12:52:25 EST


On Tue, Apr 27, 2021 at 1:25 AM Simon Ser <contact@xxxxxxxxxxx> wrote:
>
> Rather than requiring changes in all compositors *and* clients, can we
> maybe only require changes in compositors? For instance, OpenBSD has a
> __MAP_NOFAULT flag. When passed to mmap, it means that out-of-bound
> accesses will read as zeroes instead of triggering SIGBUS. Such a flag
> would be very helpful to unblock the annoying SIGBUS situation.
>
> Would something among these lines be welcome in the Linux kernel?

Hmm. It doesn't look too hard to do. The biggest problem is actually
that we've run out of flags in the vma (on 32-bit architectures), but
you could try this UNTESTED patch that just does the MAP_NOFAULT thing
unconditionally.

NOTE! Not only is it untested, not only is this a "for your testing
only" (because it does it unconditionally rather than only for
__MAP_NOFAULT), but it might be bogus for other reasons. In
particular, this patch depends on "vmf->address" not being changed by
the ->fault() infrastructure, so that we can just re-use the vmf for
the anonymous case if we get a SIGBUS.

I think that's all ok these days, because Kirill and Peter Xu cleaned
up those paths, but I didn't actually check. So I'm cc'ing Kirill,
Peter and Will, who have been working in this area for other reasons
fairly recently.

Side note: this will only ever work for non-shared mappings. That's
fundamental. We won't add an anonymous page to a shared mapping, and
do_anonymous_page() does verify that. So a MAP_SHARED mappign will
still return SIGBUS even with this patch (although it's not obvious
from the patch - the VM_FAULT_SIGBUS will just be re-created by
do_anonymous_page()).

So if you want a _shared_ mapping to honor __MAP_NOFAULT and insert
random anonymous pages into it, I think the answer is "no, that's not
going to be viable".

So _if_ this works for you, and if it's ok that only MAP_PRIVATE can
have __MAP_NOFAULT, and if Kirill/Peter/Will don't say "Oh, Linus,
you're completely off your rocker and clearly need to be taking your
meds", something like this - if we figure out the conditional bit -
might be doable.

That's a fair number of "ifs".

Ok, back to the merge window for me, I'll be throwing away this crazy
untested patch immediately after hitting "send". This is very much a
"throw the idea over to other people" patch, in other words.

Linus
mm/memory.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 550405fc3b5e..bbede6b52f7a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4312,10 +4312,21 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
}

if (!vmf->pte) {
- if (vma_is_anonymous(vmf->vma))
- return do_anonymous_page(vmf);
- else
- return do_fault(vmf);
+ if (!vma_is_anonymous(vmf->vma)) {
+ vm_fault_t ret = do_fault(vmf);
+ if (ret & VM_FAULT_RETRY)
+ return ret;
+ if (!(ret & VM_FAULT_SIGBUS))
+ return ret;
+/* FIXME! We don't have a VM_NOFAULT bit */
+#if 0
+ /* See if we should turn a SIGBUS into an anonymous page */
+ if (!(vma->vm_flags & VM_NOFAULT))
+ return ret;
+#endif
+/* Fall back on do_anonymous_page() instead of SIGBUS */
+ }
+ return do_anonymous_page(vmf);
}

if (!pte_present(vmf->orig_pte))