Re: [PATCH v3 4/5] selftests/mseal: add more tests for mmap

From: Pedro Falcato
Date: Sun Sep 08 2024 - 17:57:09 EST


On Sun, Sep 8, 2024 at 10:35 PM Pedro Falcato <pedro.falcato@xxxxxxxxx> wrote:
> Hi,
>
> I agree with most of the points. Sitting down here to write unofficial
> guidelines for mseal behavior.
>
> mseal should seal regions and mark them immutable, which means their protection
> and contents (ish?) (not _only_ backing mapping, but also contents in general
> (see madvise below)) should not be changed throughout the lifetime of the address space.
>
> For the general syscall interface, this means:
> 1) mprotect and munmap need to be blocked on mseal regions.
> 1a) munmap _cannot_ tolerate partial failure, per POSIX.
> 2b) mmap MAP_FIXED counts as an unmap operation and also needs to be blocked and return -EPERM.
>
> 2) Most madvise calls are allowed, except for destructive operations on
> read-only anonymous _pages_ (MADV_DONTNEED is destructive for anon, but we also don't care
> about blocking these ops if we can do it manually with e.g memset)
> 2a) The current implementation only blocks discard on anonymous _regions_, which is slightly
> different. We probably do want to block these on MAP_PRIVATE file mappings, as to block
> stuff like madvise MADV_DONTNEED on program rodata.
> 2b) We take into account pkeys when doing the permission checks.
>
> 3) mremap is not allowed as we'd change the "contents" of the old region.
> 3a) Should mremap expansion be allowed? aka only block moving and shrinking, but allow expansion.
> We already informally allow expansion if e.g mmapping after it + mseal.
>
> 4) mlock and msync are allowed.
>
> 5) mseal is blocked.
>
> 6) Other miscellaneous syscalls (mbind, etc) that do not change contents in any way, are allowed.
> 6a) This obviously means PTEs can change as long as the contents don't. Swapping is also ok.
>
> 7) FOLL_FORCE (kernel-internal speak, more commonly seen as ptrace and /proc/self/mem from userspace)
> should be disallowed (?)
> 7a) This currently does not happen, and seems like a large hole? But disallowing this
> would probably severely break ptrace if the ELF sealing plans come to fruition.
>
> When we say "disallowed", we usually (apart from munmap) allow for partial failure. This
> means getting an -EPERM while part of the call succeeded, if we e.g mprotect a region consisting
> of [NORMAL VMA][SEALED VMA]. We do not want to test for this, because we do not want to paint ourselves
> into a corner - this is strictly "undefined behavior". The msealed regions themselves
> will never be touched in such cases. (we do however want to test munmap operation atomicity, but this is
> also kind of a munmap-related test, and might not really be something we really want in the mseal tests)
>
> Kernel-internal wise: The VMA and PTE modifications resulting from the above operations are blocked.
> Sealed VMAs allow splitting and merging; there was contention about the splitting issue, but it truly
> does not make sense to block operations unless they affect a VMA entirely, and that would also force
> VMA merging to be ABI ("vma_merge isn't merging these two regions and now my madvise works/doesn't work :(").
>
>
> Do I have everything right? Am I missing anything?

Small addendum: file write, truncate and hole punching might also
matter for sealed file-backed regions, as these change the region's
contents, but we probably
want to rely on file write permissions to protect against this (as we
already do). Any other solution is probably terrible and probably
endlessly NAK'd by fs folks, but it does
mean sealed regions aren't really immutable if you or the attacker can
write to the file.

--
Pedro