Re: [PATCH v3 4/5] selftests/mseal: add more tests for mmap

From: Jeff Xu
Date: Fri Sep 13 2024 - 19:00:27 EST


Hi Pedro

On Sun, Sep 8, 2024 at 2:35 PM Pedro Falcato <pedro.falcato@xxxxxxxxx> wrote:
>
> I agree with most of the points. Sitting down here to write unofficial
> guidelines for mseal behavior.
>
> mseal should seal regions and mark them immutable, which means their protection
> and contents (ish?) (not _only_ backing mapping, but also contents in general
> (see madvise below)) should not be changed throughout the lifetime of the address space.
>
> For the general syscall interface, this means:
> 1) mprotect and munmap need to be blocked on mseal regions.
> 1a) munmap _cannot_ tolerate partial failure, per POSIX.
> 2b) mmap MAP_FIXED counts as an unmap operation and also needs to be blocked and return -EPERM.
>
> 2) Most madvise calls are allowed, except for destructive operations on
> read-only anonymous _pages_ (MADV_DONTNEED is destructive for anon, but we also don't care
> about blocking these ops if we can do it manually with e.g memset)
> 2a) The current implementation only blocks discard on anonymous _regions_, which is slightly
> different. We probably do want to block these on MAP_PRIVATE file mappings, as to block
> stuff like madvise MADV_DONTNEED on program rodata.
> 2b) We take into account pkeys when doing the permission checks.
>
> 3) mremap is not allowed as we'd change the "contents" of the old region.
> 3a) Should mremap expansion be allowed? aka only block moving and shrinking, but allow expansion.
> We already informally allow expansion if e.g mmapping after it + mseal.
>
> 4) mlock and msync are allowed.
>
> 5) mseal is blocked.
mseal is not blocked, i.e. seal on an already sealed memory is
no-op. This is described in mseal.rst [1]

[1] https://github.com/torvalds/linux/blob/master/Documentation/userspace-api/mseal.rst

>
> 6) Other miscellaneous syscalls (mbind, etc) that do not change contents in any way, are allowed.
> 6a) This obviously means PTEs can change as long as the contents don't. Swapping is also ok.
>
> 7) FOLL_FORCE (kernel-internal speak, more commonly seen as ptrace and /proc/self/mem from userspace)
> should be disallowed (?)
> 7a) This currently does not happen, and seems like a large hole? But disallowing this
> would probably severely break ptrace if the ELF sealing plans come to fruition.
>
Jann Horn pointed out FOLL_FORCE during RFC [2], and this is in mseal.rst too.

In short, FOLL_FORCE is not covered by mseal. On ChromeOS, FOLL_FORCE
is disabled. Recently, Adrian Ratiu upstreamed that [3]

[2] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@xxxxxxxxxxxxxx/
[3] https://lore.kernel.org/lkml/20240802080225.89408-1-adrian.ratiu@xxxxxxxxxxxxx/

-Jeff

> When we say "disallowed", we usually (apart from munmap) allow for partial failure. This
> means getting an -EPERM while part of the call succeeded, if we e.g mprotect a region consisting
> of [NORMAL VMA][SEALED VMA]. We do not want to test for this, because we do not want to paint ourselves
> into a corner - this is strictly "undefined behavior". The msealed regions themselves
> will never be touched in such cases. (we do however want to test munmap operation atomicity, but this is
> also kind of a munmap-related test, and might not really be something we really want in the mseal tests)
>
> Kernel-internal wise: The VMA and PTE modifications resulting from the above operations are blocked.
> Sealed VMAs allow splitting and merging; there was contention about the splitting issue, but it truly
> does not make sense to block operations unless they affect a VMA entirely, and that would also force
> VMA merging to be ABI ("vma_merge isn't merging these two regions and now my madvise works/doesn't work :(").
>
>
> Do I have everything right? Am I missing anything?
>
> --
> Pedro