Re: Sealed memfd & no-fault mmap

From: Hugh Dickins
Date: Sat May 29 2021 - 16:31:43 EST


On Sat, 29 May 2021, Linus Torvalds wrote:
> On Fri, May 28, 2021 at 9:31 PM Lin, Ming <minggr@xxxxxxxxx> wrote:
> >
> > I should check the vma is not writable.
> >
> > - if (!IS_NOFAULT(inode))
> > + if (!IS_NOFAULT(inode) || (vma->vm_flags & VM_WRITE))
> > return -EINVAL;
>
> That might be enough, yes.
>
> But if this is sufficient for the compositor needs, and the rule is
> that this only works for read-only mappings, then I think the flag in
> the inode becomes the wrong thing to do.
>
> Because if it's a read-only mapping, and we only ever care about
> inserting zero pages into the page tables - and they never become part
> of the shared memory region itself, then it really is purely about
> that mmap, not about the shm inode.
>
> So then it really does become purely about one particular mmap, and it
> really should be a "madvise()" issue, not a "mark inode as no-fault".

Yes, madvise or mmap flag: the recipient of this fd ought not to be
(even capable of) interfering with other maps of the shared object.

And IIUC it would have to be the recipient (Wayland compositor) doing
the NOFAULT business, because (going back to the original mail) we are
only considering this so that Wayland might satisfy clients who predate
or refuse Linux-only APIs. So, an ioctl (or fcntl, as sealing chose)
at the client end cannot be expected; and could not be relied on anyway.

>
> I'd almost be inclined to just add a new "flags" field to the vma.
> We've been running out of vma flags for a long time, to the point that
> some of them are only available on 64-bit architectures.
>
> I get the feeling that we should just bite the bullet and make
> "vm_flags" be an u64. Or possibly make it two explicitly 32-bit
> entities (vm_flags and vm_extra). Get rid of the silly 64-bit-only
> "high" flags, and get rid of our artificial "we don't have enough
> bits".

u64 saves messing around in the vma_merge() area, which has to
consider whether adjacent vm_flags are identical.

>
> Because we already in practice *do* have enough bits, we've just
> artificially limited ourselves to "on 32-bit architectures we only
> have 32 bits in that field".

Yes, that artificial limitation to 32-bit has been silly all along.

>
> But all of this is very much dependent on that "this NOFAULT case
> really only works for reads, not for writes".
>
> (Alternatively, we could allow the *mapping* itself to be writable,
> but always fault on writes, and only insert a read-only zero page)

NOFAULT? Does BSD use "fault" differently, and in Linux terms we
would say NOSIGBUS to mean the same?

Can someone point to a specification of BSD's __MAP_NOFAULT?
Searching just found me references to bugs.

What mainly worries me about the suggestion is: what happens to the
zero page inserted into NOFAULT mappings, when later a page for that
offset is created and added to page cache?

Treating it as an opaque blob of zeroes, that stays there ever after,
hiding the subsequent data: easy to implement, but a hack that we would
probably regret. (And I notice that even the quote from David Herrmann
in the original post allows for the possibility that client may want to
expand the object.)

I believe the correct behaviour would be to unmap the nofault page
then, allowing the proper page to be faulted in after. That is
certainly doable (the old mm/filemap_xip.c used to do so), but might
get into some awkward race territory, with filesystem dependence
(reminiscent of hole punch, in reverse). shmem could operate that
way, and be the better for it: but I wouldn't want to add that,
without also cleaning away all the shmem_recalc_inode() stuff.

Hugh