Re: [PATCH v4 1/1] exec: seal system mappings
From: Jeff Xu
Date: Wed Jan 15 2025 - 14:02:49 EST
On Mon, Jan 13, 2025 at 1:26 PM Jeff Xu <jeffxu@xxxxxxxxxxxx> wrote:
>
> On Mon, Jan 6, 2025 at 5:12 PM Kees Cook <kees@xxxxxxxxxx> wrote:
> >
> > On Fri, Jan 03, 2025 at 09:38:10PM +0000, Lorenzo Stoakes wrote:
> > > On Tue, Dec 17, 2024 at 02:18:53PM -0800, Kees Cook wrote:
> > > > On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@xxxxxxxxxxxx wrote:
> > > > > Seal vdso, vvar, sigpage, uprobes and vsyscall.
> > > > >
> > > > > Those mappings are readonly or executable only, sealing can protect
> > > > > them from ever changing or unmapped during the life time of the process.
> > > > > For complete descriptions of memory sealing, please see mseal.rst [1].
> > > > >
> > > > > System mappings such as vdso, vvar, and sigpage (for arm) are
> > > > > generated by the kernel during program initialization, and are
> > > > > sealed after creation.
> > > > > [...]
> > > > >
> > > > > + exec.seal_system_mappings = [KNL]
> > > > > + Format: { no | yes }
> > > > > + Seal system mappings: vdso, vvar, sigpage, vsyscall,
> > > > > + uprobe.
> > > > > + - 'no': do not seal system mappings.
> > > > > + - 'yes': seal system mappings.
> > > > > + This overrides CONFIG_SEAL_SYSTEM_MAPPINGS=(y/n)
> > > > > + If not specified or invalid, default is the value set by
> > > > > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > > > > + This option has no effect if CONFIG_64BIT=n
> > > >
> > > > I know there is a v5 coming, but I wanted to give my thoughts to help
> > > > shape it based on the current discussion threads.
> > > >
> > > > The callers of _install_special_mapping() cover what is mentioned here.
> > > > The vdso is very common (arm, arm64, csky, hexagon, loongarch, mips,
> > > > parisc, powerpc, riscv, s390, sh, sparc, x86, um). For those with vdso,
> > > > some also have vvar (arm, arm64, loongarch, mips, powerpc, riscv, s390,
> > > > sparc, x86). After that, I see a few extra things, in addition to
> > > > sigpage and uprobes as mentioned already in the patch:
> > > >
> > > > arm sigpage
> > > > arm64 compat vectors (what is this for arm?)
> > > > arm64 compat sigreturn (what is this for arm?)
> > > > nios2 kuser helpers
> > > > uprobes
> > >
> > > OK let's not get ahead of ourselves :)
> > >
> > > VDSOs/gate VMAs are treated quite differently by different arches. So we
> > > have to tread _very_ carefully here.
> > >
> > > I believe PPC doe some 'tricky' things and may actually want to unmap, for
> > > instance.
> > >
> > > The problem with this kind of change is we're doing something fundamental
> > > that impacts _every possible combinatorial combination of configs, arches,
> > > and use cases_ for each of these which we seeming - just assume - will have
> > > no issue with this.
> > >
> > > This is insufficient, deeply. We need:
> > >
> > > 1. Strong justification (hand waving won't suffice).
> > > 2. Very extensive testing and checking, and _proof_ of this testing being
> > > performed.
> > > 3. Buy-in from arch maintainers.
> > >
> > > So far this series has provided none of those. This is why I am cautious
> > > and pushing back here.
> >
> > Sure, I agree. This is why I was suggested the ...ARCH_HAS... Kconfig.
> > That will provide the way for 3) to happen. 1) just needs a little more
> > details in the commit log, I guess? The goal is attack surface reduction
> > in userspace, and remapping shenanigans have become a recent avenue of
> > attack.
> >
> > For 2) there are limits. As you say we may have "every possible
> > combinatorial combination", which may not be feasible to test. But
> > making it available for the common cases (and of course testing those)
> > makes sense.
> >
> > > And I absolutely will not accept a user being able to turn on a switch in a
> > > known-broken configuration. This is absolutely unacceptable.
> >
> > Sure, of course.
> >
> > > It's equally unacceptable for a user to enable a feature that is
> > > untested/confirmed on an architecture.
> >
> > Agreed.
> >
> > > So let's be careful about Linus's edict here - the operative part being 'if
> > > it doesn't break things'.
> >
> > Right -- I should clarify: I don't mean to say "it should be enabled by
> > default", I meant to say that we have a common pattern for making these
> > kinds of features available without hiding them behind a build-time
> > Kconfig that would have put the features out of reach for system owners
> > that only use distro kernels, etc. I was pushing back on an earlier
> > comment that I interpreted as rejecting boot params. A boot param (when
> > other aspects of the system are sane) is needed for this kind of thing,
> > and is the pattern we use for providing optional features that distros
> > can make available without enabling them by default.
> >
> > > > So, if we want to have a CONFIG_MSEAL_SYSTEM_MAPPINGS at all, it should
> > > > be "default y" since we have the ...ARCH_HAS... config already, and then
> > > > add a CONFIG_MSEAL_SYSTEM_MAPPINGS_DEFAULT that is off by default (since
> > > > we expect there may be userspace impact) and tie _that_ to the kernel
> > > > command-line so that end users can use it, or system builders can enable
> > > > CONFIG_MSEAL_SYSTEM_MAPPINGS_DEFAULT.
> > >
> > > Again, I hate to push on this, but I am simply not going to allow users to
> > > enable features we know break things.
> > >
> > > Users might not be aware this feature is broken for CRIU, and X, and Y and
> > > whatever else we've not thought about and enable it thinking it helps
> > > security, and end up with a broken system.
> >
> > This will never be a bright line, and I think choice is more important.
> > For example, Ubuntu builds with CRIU, but only a tiny set of tools
> > actually use it. (I've actually been considering adding a boot param to
> > disable CRIU features since they undermine some aspects of userspace
> > security.)
> >
> > Regardless, yes, if we can make this work with CRIU (which I thought
> > there seem to be consensus on), let's do it.
> >
> > > This seems like putting the onus on CRIU users to deal with a known-broken
> > > thing? That seems really unreasonable? And people would just have to have
> > > the right userland code to work in the kernel with mseal?
> > >
> > > Yeah I oppose entirely this unless I'm missing something?
> >
> > Hm, well, the primary goal is for Chrome OS and Android to use this. If
> > there is honestly no path forward with CRIU, then hard Kconfig conflict
> > it is. I'd much rather have it available for anyone who wants it, just
> > like we do with lots of other features. Why force people who want this
> > and not CRIU to build their own kernels? We have all kinds of boot params
> > that if you set you get a broken system.
> >
> This patch is intended for ChromeOS and Android and is
> feature-complete from their perspective.
>
> To simplify v5, I propose removing kernel-cmd-line and avoiding the
> complexities of CRIU/UML and gVisor. The KCONFIG is disabled by
> default and will only apply to ARM and Intel architectures.
>
> Later when a generic distribution wants to enable this feature, we can
> work out a solution to handle those complexities.
>
> Is this a reasonable path to move forward ?
>
If a complete solution is desired, we can continue to discuss the
open/unresolved items.
This summarizes code logic requirements/comments.
1> Enable the feature per architecture (Lorenze)
Current state: Already in current patch (v4)
2> CONFIG_SEAL_SYSTEM_MAPPINGS
Current State: In v4, this depends on !CHECKPOINT_RESTORE. This
dependency is unhelpful for gVisor and UML (which don't depend on
CRIU) and should be removed.
(next step): remove !CHECKPOINT_RESTORE dependency.
3> Per process prctl (Andrei Vagin, Benjamin Berg)
Current state: Not in v4. Andrei suggested using prctl for opt-out.
This shouldn't depend on CRIU/CAP_CHECKPOINT_RESTORE (since gVisor is
a different feature)
(next step): Add this opt-out prctl under the new KCONFIG:
CONFIG_SEAL_SYSTEM_MAPPING_WITH_PRCTL_OPTOUT
4> kernel cmd line:
Current state: This is in v4. Lorenze commented that the new kernel
cmd line doesn't have effect on the 32 bit build, so users might think
it is enabled but not.
This is due to the fact that the 32 bit kernel doesn't support
mseal(), i.e. mseal.c is not even built under 32 bit.
Standard practice when processing kernel cmd lines with incorrect
input is to do nothing (no logs or crashes). Addressing this is
difficult unless mseal is supported for all architectures.
These are all the code logic related comments. Did I miss anything ?
Other non-code related comments (threat-model, tests, etc) will be added later.
To simplify the next version (if no agreement is reached), I can focus
on items 1 and 2. This would satisfy ChromeOS and Android, as well as
distributions and users who don't use CRIU/UML/gVisor (likely the
majority). Items 3 and 4 could be addressed later.
Comments, suggestions, and alternative approaches are welcome.
Thanks
-Jeff
> Thanks
> -Jeff
>
> > -Kees
> >
> > --
> > Kees Cook