Re: [PATCH v4 0/2] Control over userfaultfd kernel-fault handling

From: Nick Kralevich
Date: Thu Oct 08 2020 - 19:22:56 EST


On Wed, Oct 7, 2020 at 9:01 PM Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
>
> Hello Lokesh,
>
> On Wed, Oct 07, 2020 at 01:26:55PM -0700, Lokesh Gidra wrote:
> > On Wed, Sep 23, 2020 at 11:56 PM Lokesh Gidra <lokeshgidra@xxxxxxxxxx> wrote:
> > >
> > > This patch series is split from [1]. The other series enables SELinux
> > > support for userfaultfd file descriptors so that its creation and
> > > movement can be controlled.
> > >
> > > It has been demonstrated on various occasions that suspending kernel
> > > code execution for an arbitrary amount of time at any access to
> > > userspace memory (copy_from_user()/copy_to_user()/...) can be exploited
> > > to change the intended behavior of the kernel. For instance, handling
> > > page faults in kernel-mode using userfaultfd has been exploited in [2, 3].
> > > Likewise, FUSE, which is similar to userfaultfd in this respect, has been
> > > exploited in [4, 5] for similar outcome.
> > >
> > > This small patch series adds a new flag to userfaultfd(2) that allows
> > > callers to give up the ability to handle kernel-mode faults with the
> > > resulting UFFD file object. It then adds a 'user-mode only' option to
> > > the unprivileged_userfaultfd sysctl knob to require unprivileged
> > > callers to use this new flag.
> > >
> > > The purpose of this new interface is to decrease the chance of an
> > > unprivileged userfaultfd user taking advantage of userfaultfd to
> > > enhance security vulnerabilities by lengthening the race window in
> > > kernel code.
> > >
> > > [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@xxxxxxxxxx/
> > > [2] https://duasynt.com/blog/linux-kernel-heap-spray
> > > [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
>
> I've looking at those links and I've been trying to verify the link
> [3] is relevant.
>
> Specifically I've been trying to verify if 1) current state of the art
> modern SLUB randomization techniques already enabled in production and
> rightfully wasting some CPU in all enterprise kernels to prevent
> things like above to become an issue in practice 2) combined with the
> fact different memcg need to share the same kmemcaches (which was
> incidentally fixed a few months ago upstream) and 3) further
> robustness enhancements against exploits in the slub metadata, may
> already render the exploit [3] from 2016 irrelevant in practice.

It's quite possible that some other mitigation was helpful against the
technique used by this particular exploit. It's the nature of exploits
that they are fragile and will change as new soft mitigations are
introduced. The effectiveness of a particular exploit mitigation
change is orthogonal to the change presented here.

The purpose of this change is to prevent an attacker from suspending
kernel code execution and having kernel data structures in a
predictable state. This makes it harder for an attacker to "win" race
conditions against various kernel data structures. This change
compliments other kernel hardening changes such as the changes you've
referenced above. Focusing on one particular exploit somewhat misses
the point of this change.

>
> So I started by trying to reproduce [3] by building 4.5.1 with a
> .config with no robustness features and I booted it on fedora-32 or
> gentoo userland and I cannot even invoke call_usermodehelper. Calling
> socket(22, AF_INET, 0) won't invoke such function. Can you reproduce
> on 4.5.1? Which kernel .config should I use to build 4.5.1 in order
> for call_usermodehelper to be invoked by the exploit? Could you help
> to verify it?

I haven't tried to verify this myself. I wonder if the usermode
hardening changes also impacted this exploit? See
https://lkml.org/lkml/2017/1/16/468

But again, focusing on an exploit, which is inherently fragile in
nature and dependent on the state of the kernel tree at a particular
time, is unlikely to be useful to analyze this patch.

>
> It even has uninitialized variable spawning random perrors so it
> doesn't give a warm fuzzy feeling:
>
> ====
> int main(int argc, char **argv) {
> void *region, *map;
> ^^^^^
> pthread_t uffd_thread;
> int uffd, msqid, i;
>
> region = (void *)mmap((void *)0x40000000, 0x2000, PROT_READ|PROT_WRITE,
> MAP_FIXED|MAP_PRIVATE|MAP_ANON, -1, 0);
>
> if (!region) {
> perror("mmap");
> exit(2);
> }
>
> setup_pagefault(region + 0x1000, 0x1000, 1);
>
> printf("my pid = %d\n", getpid());
>
> if (!map) {
> ^^^^^^^^
> perror("mmap");
> ====
>
> The whole point of being able to reproduce on 4.5.1 is then to
> simulate if the same exploit would also reproduce on current kernels
> with all enterprise default robustness features enabled. Or did
> anybody already verify it?
>
> Anyway the links I was playing with are all in the cover letter, the
> cover letter is not as important as the actual patches. The actual
> patches looks fine to me.

That's great to hear.

>
> The only improvement I can think of is, what about to add a
> printk_once to suggest to toggle the sysctl if userfaultfd bails out
> because the process lacks the CAP_SYS_PTRACE capability? That would
> facilitate the /etc/sysctl.conf or tuned tweaking in case the apps
> aren't verbose enough.
>
> It's not relevant anymore with this latest patchset, but about the
> previous argument that seccomp couldn't be used in all Android
> processes because of performance concern, I'm slightly confused.

Seccomp causes more problems than just performance. Seccomp is not
designed for whole-of-system protections. Please see my other writeup
at https://lore.kernel.org/lkml/CAFJ0LnEo-7YUvgOhb4pHteuiUW+wPfzqbwXUCGAA35ZMx11A-w@xxxxxxxxxxxxxx/

>
> https://android-developers.googleblog.com/2017/07/seccomp-filter-in-android-o.html
>
> "Android O includes a single seccomp filter installed into zygote, the
> process from which all the Android applications are derived. Because
> the filter is installed into zygote—and therefore all apps—the Android
> security team took extra caution to not break existing apps"
>
> Example:
>
> $ uname -mo
> aarch64 Android
> $ cat swapoff.c
> #include <sys/swap.h>
>
> int main()
> {
> swapoff("");
> }
> $ gcc swapoff.c -o swapoff -O2
> $ ./swapoff
> Bad system call
> $
>
> It's hard to imagine what is more performance critical than the zygote
> process and the actual apps as above?
>
> It's also hard to imagine what kind of performance concern can arise
> by adding seccomp filters also to background system apps that
> generally should consume ~0% of CPU.
>
> If performance is really a concern, the BPF JIT representation with
> the bitmap to be able to run the filter in O(1) sounds a better
> solution than not adding ad-hoc filters and it's being worked on for
> x86-64 and can be ported to aarch64 too. Many of the standalone
> background processes likely wouldn't even use uffd at all so you could
> block the user initiated faults too that way.
>
> Ultimately because of issues as [3] (be them still relevant or not, to
> be double checked), no matter if through selinux, seccomp or a
> different sysctl value, without this patchset applied the default
> behavior of the userfaultfd syscall for all Linux binaries running on
> Android kernels, would deviate from the upstream kernel. So even if we
> would make the pipe mutex logic more complex the deviation would
> remain. Your patchset adds much less risk of breakage than adding a
> timeout to kernel initiated userfaults and it resolves all concerns as
> well as a timeout. We'll also make better use of the "0" value this
> way. So while I'm not certain this is the best for the long term, this
> looks the sweet spot for the short term to resolve many issues at
> once.
>
> Thanks!
> Andrea
>


--
Nick Kralevich | nnk@xxxxxxxxxx