Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only

From: Michael S. Tsirkin
Date: Thu Aug 06 2020 - 01:44:28 EST


On Wed, Aug 05, 2020 at 05:43:02PM -0700, Nick Kralevich wrote:
> On Fri, Jul 24, 2020 at 6:40 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
> >
> > On Thu, Jul 23, 2020 at 05:13:28PM -0700, Nick Kralevich wrote:
> > > On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra <lokeshgidra@xxxxxxxxxx> wrote:
> > > > From the discussion so far it seems that there is a consensus that
> > > > patch 1/2 in this series should be upstreamed in any case. Is there
> > > > anything that is pending on that patch?
> > >
> > > That's my reading of this thread too.
> > >
> > > > > > Unless I'm mistaken that you can already enforce bit 1 of the second
> > > > > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this
> > > > > > would be more a question to the Android userland team.
> > > > > >
> > > > > > The question would be: does it ever happen that a seccomp filter isn't
> > > > > > already applied to unprivileged software running without
> > > > > > SYS_CAP_PTRACE capability?
> > > > >
> > > > > Yes.
> > > > >
> > > > > Android uses selinux as our primary sandboxing mechanism. We do use
> > > > > seccomp on a few processes, but we have found that it has a
> > > > > surprisingly high performance cost [1] on arm64 devices so turning it
> > > > > on system wide is not a good option.
> > > > >
> > > > > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad
> > >
> > > As Jeff mentioned, seccomp is used strategically on Android, but is
> > > not applied to all processes. It's too expensive and impractical when
> > > simpler implementations (such as this sysctl) can exist. It's also
> > > significantly simpler to test a sysctl value for correctness as
> > > opposed to a seccomp filter.
> >
> > Given that selinux is already used system-wide on Android, what is wrong
> > with using selinux to control userfaultfd as opposed to seccomp?
>
> Userfaultfd file descriptors will be generally controlled by SELinux.
> You can see the patchset at
> https://lore.kernel.org/lkml/20200401213903.182112-3-dancol@xxxxxxxxxx/
> (which is also referenced in the original commit message for this
> patchset). However, the SELinux patchset doesn't include the ability
> to control FAULT_FLAG_USER / UFFD_USER_MODE_ONLY directly.
>
> SELinux already has the ability to control who gets CAP_SYS_PTRACE,
> which combined with this patch, is largely equivalent to direct
> UFFD_USER_MODE_ONLY checks. Additionally, with the SELinux patch
> above, movement of userfaultfd file descriptors can be mediated by
> SELinux, preventing one process from acquiring userfaultfd descriptors
> of other processes unless allowed by security policy.
>
> It's an interesting question whether finer-grain SELinux support for
> controlling UFFD_USER_MODE_ONLY should be added. I can see some
> advantages to implementing this. However, we don't need to decide that
> now.
>
> Kernel security checks generally break down into DAC (discretionary
> access control) and MAC (mandatory access control) controls. Most
> kernel security features check via both of these mechanisms. Security
> attributes of the system should be settable without necessarily
> relying on an LSM such as SELinux. This patch follows the same basic
> model -- system wide control of a hardening feature is provided by the
> unprivileged_userfaultfd_user_mode_only sysctl (DAC), and if needed,
> SELinux support for this can also be implemented on top of the DAC
> controls.
>
> This DAC/MAC split has been successful in several other security
> features. For example, the ability to map at page zero is controlled
> in DAC via the mmap_min_addr sysctl [1], and via SELinux via the
> mmap_zero access vector [2]. Similarly, access to the kernel ring
> buffer is controlled both via DAC as the dmesg_restrict sysctl [3], as
> well as the SELinux syslog_read [2] check. Indeed, the dmesg_restrict
> sysctl is very similar to this patch -- it introduces a capability
> (CAP_SYSLOG, CAP_SYS_PTRACE) check on access to a sensitive resource.
>
> If we want to ensure that a security feature will be well tested and
> vetted, it's important to not limit its use to LSMs only. This ensures
> that kernel and application developers will always be able to test the
> effects of a security feature, without relying on LSMs like SELinux.
> It also ensures that all distributions can enable this security
> mitigation should it be necessary for their unique environments,
> without introducing an SELinux dependency. And this patch does not
> preclude an SELinux implementation should it be necessary.
>
> Even if we decide to implement fine-grain SELinux controls on
> UFFD_USER_MODE_ONLY, we still need this patch. We shouldn't make this
> an either/or choice between SELinux and this patch. Both are
> necessary.
>
> -- Nick
>
> [1] https://wiki.debian.org/mmap_min_addr
> [2] https://selinuxproject.org/page/NB_ObjectClassesPermissions
> [3] https://www.kernel.org/doc/Documentation/sysctl/kernel.txt

I am not sure I agree this is similar to dmesg access.

The reason I say it is this: it is pretty easy for admins to know
whether they run something that needs to access the kernel ring buffer.
Or if it's a tool developer poking at dmesg, they can tell admins "we
need these permissions". But it seems impossible for either an admin to
know that a userfaultfd page e.g. used with shared memory is accessed
from the kernel.

So I guess the question is: how does anyone not running Android
know to set this flag?

I got the feeling it's not really possible, and so for a single-user
feature like this a single API seems enough. Given a choice between a
knob an admin is supposed to set and selinux policy written by
presumably knowledgeable OS vendors, I'd opt for a second option.

Hope this helps.

> >
> >
> > > > > >
> > > > > >
> > > > > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in
> > > > > > subject) should be enforceable with minor changes to the BPF
> > > > > > assembly. Otherwise it'd require more changes.
> > >
> > > It would be good to understand what these changes are.
> > >
> > > > > > Why exactly is it preferable to enlarge the surface of attack of the
> > > > > > kernel and take the risk there is a real bug in userfaultfd code (not
> > > > > > just a facilitation of exploiting some other kernel bug) that leads to
> > > > > > a privilege escalation, when you still break 99% of userfaultfd users,
> > > > > > if you set with option "2"?
> > >
> > > I can see your point if you think about the feature as a whole.
> > > However, distributions (such as Android) have specialized knowledge of
> > > their security environments, and may not want to support the typical
> > > usages of userfaultfd. For such distributions, providing a mechanism
> > > to prevent userfaultfd from being useful as an exploit primitive,
> > > while still allowing the very limited use of userfaultfd for userspace
> > > faults only, is desirable. Distributions shouldn't be forced into
> > > supporting 100% of the use cases envisioned by userfaultfd when their
> > > needs may be more specialized, and this sysctl knob empowers
> > > distributions to make this choice for themselves.
> > >
> > > > > > Is the system owner really going to purely run on his systems CRIU
> > > > > > postcopy live migration (which already runs with CAP_SYS_PTRACE) and
> > > > > > nothing else that could break?
> > >
> > > This is a great example of a capability which a distribution may not
> > > want to support, due to distribution specific security policies.
> > >
> > > > > >
> > > > > > Option "2" to me looks with a single possible user, and incidentally
> > > > > > this single user can already enforce model "2" by only tweaking its
> > > > > > seccomp-bpf filters without applying 2/2. It'd be a bug if android
> > > > > > apps runs unprotected by seccomp regardless of 2/2.
> > >
> > > Can you elaborate on what bug is present by processes being
> > > unprotected by seccomp?
> > >
> > > Seccomp cannot be universally applied on Android due to previously
> > > mentioned performance concerns. Seccomp is used in Android primarily
> > > as a tool to enforce the list of allowed syscalls, so that such
> > > syscalls can be audited before being included as part of the Android
> > > API.
> > >
> > > -- Nick
> > >
> > > --
> > > Nick Kralevich | nnk@xxxxxxxxxx
> >
>
>
> --
> Nick Kralevich | nnk@xxxxxxxxxx