Re: [PATCH v4 0/2] Control over userfaultfd kernel-fault handling

From: Andrea Arcangeli
Date: Thu Oct 08 2020 - 00:01:58 EST


Hello Lokesh,

On Wed, Oct 07, 2020 at 01:26:55PM -0700, Lokesh Gidra wrote:
> On Wed, Sep 23, 2020 at 11:56 PM Lokesh Gidra <lokeshgidra@xxxxxxxxxx> wrote:
> >
> > This patch series is split from [1]. The other series enables SELinux
> > support for userfaultfd file descriptors so that its creation and
> > movement can be controlled.
> >
> > It has been demonstrated on various occasions that suspending kernel
> > code execution for an arbitrary amount of time at any access to
> > userspace memory (copy_from_user()/copy_to_user()/...) can be exploited
> > to change the intended behavior of the kernel. For instance, handling
> > page faults in kernel-mode using userfaultfd has been exploited in [2, 3].
> > Likewise, FUSE, which is similar to userfaultfd in this respect, has been
> > exploited in [4, 5] for similar outcome.
> >
> > This small patch series adds a new flag to userfaultfd(2) that allows
> > callers to give up the ability to handle kernel-mode faults with the
> > resulting UFFD file object. It then adds a 'user-mode only' option to
> > the unprivileged_userfaultfd sysctl knob to require unprivileged
> > callers to use this new flag.
> >
> > The purpose of this new interface is to decrease the chance of an
> > unprivileged userfaultfd user taking advantage of userfaultfd to
> > enhance security vulnerabilities by lengthening the race window in
> > kernel code.
> >
> > [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@xxxxxxxxxx/
> > [2] https://duasynt.com/blog/linux-kernel-heap-spray
> > [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit

I've looking at those links and I've been trying to verify the link
[3] is relevant.

Specifically I've been trying to verify if 1) current state of the art
modern SLUB randomization techniques already enabled in production and
rightfully wasting some CPU in all enterprise kernels to prevent
things like above to become an issue in practice 2) combined with the
fact different memcg need to share the same kmemcaches (which was
incidentally fixed a few months ago upstream) and 3) further
robustness enhancements against exploits in the slub metadata, may
already render the exploit [3] from 2016 irrelevant in practice.

So I started by trying to reproduce [3] by building 4.5.1 with a
.config with no robustness features and I booted it on fedora-32 or
gentoo userland and I cannot even invoke call_usermodehelper. Calling
socket(22, AF_INET, 0) won't invoke such function. Can you reproduce
on 4.5.1? Which kernel .config should I use to build 4.5.1 in order
for call_usermodehelper to be invoked by the exploit? Could you help
to verify it?

It even has uninitialized variable spawning random perrors so it
doesn't give a warm fuzzy feeling:

====
int main(int argc, char **argv) {
void *region, *map;
^^^^^
pthread_t uffd_thread;
int uffd, msqid, i;

region = (void *)mmap((void *)0x40000000, 0x2000, PROT_READ|PROT_WRITE,
MAP_FIXED|MAP_PRIVATE|MAP_ANON, -1, 0);

if (!region) {
perror("mmap");
exit(2);
}

setup_pagefault(region + 0x1000, 0x1000, 1);

printf("my pid = %d\n", getpid());

if (!map) {
^^^^^^^^
perror("mmap");
====

The whole point of being able to reproduce on 4.5.1 is then to
simulate if the same exploit would also reproduce on current kernels
with all enterprise default robustness features enabled. Or did
anybody already verify it?

Anyway the links I was playing with are all in the cover letter, the
cover letter is not as important as the actual patches. The actual
patches looks fine to me.

The only improvement I can think of is, what about to add a
printk_once to suggest to toggle the sysctl if userfaultfd bails out
because the process lacks the CAP_SYS_PTRACE capability? That would
facilitate the /etc/sysctl.conf or tuned tweaking in case the apps
aren't verbose enough.

It's not relevant anymore with this latest patchset, but about the
previous argument that seccomp couldn't be used in all Android
processes because of performance concern, I'm slightly confused.

https://android-developers.googleblog.com/2017/07/seccomp-filter-in-android-o.html

"Android O includes a single seccomp filter installed into zygote, the
process from which all the Android applications are derived. Because
the filter is installed into zygote—and therefore all apps—the Android
security team took extra caution to not break existing apps"

Example:

$ uname -mo
aarch64 Android
$ cat swapoff.c
#include <sys/swap.h>

int main()
{
swapoff("");
}
$ gcc swapoff.c -o swapoff -O2
$ ./swapoff
Bad system call
$

It's hard to imagine what is more performance critical than the zygote
process and the actual apps as above?

It's also hard to imagine what kind of performance concern can arise
by adding seccomp filters also to background system apps that
generally should consume ~0% of CPU.

If performance is really a concern, the BPF JIT representation with
the bitmap to be able to run the filter in O(1) sounds a better
solution than not adding ad-hoc filters and it's being worked on for
x86-64 and can be ported to aarch64 too. Many of the standalone
background processes likely wouldn't even use uffd at all so you could
block the user initiated faults too that way.

Ultimately because of issues as [3] (be them still relevant or not, to
be double checked), no matter if through selinux, seccomp or a
different sysctl value, without this patchset applied the default
behavior of the userfaultfd syscall for all Linux binaries running on
Android kernels, would deviate from the upstream kernel. So even if we
would make the pipe mutex logic more complex the deviation would
remain. Your patchset adds much less risk of breakage than adding a
timeout to kernel initiated userfaults and it resolves all concerns as
well as a timeout. We'll also make better use of the "0" value this
way. So while I'm not certain this is the best for the long term, this
looks the sweet spot for the short term to resolve many issues at
once.

Thanks!
Andrea