Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only

From: Nick Kralevich
Date: Sat Sep 19 2020 - 14:14:23 EST

On Fri, Sep 4, 2020 at 5:36 PM Lokesh Gidra <lokeshgidra@xxxxxxxxxx> wrote:
> On Thu, Sep 3, 2020 at 8:34 PM Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
> >
> > 1) why don't you enforce the block of kernel initiated faults with
> > seccomp-bpf instead of adding a sysctl value 2? Is the sysctl just
> > an optimization to remove a few instructions per syscall in the bpf
> > execution of Android unprivileged apps? You should block a lot of
> > other syscalls by default to all unprivileged processes, including
> > vmsplice.
> >
> > In other words if it's just for Android, why can't Android solve it
> > with only patch 1/2 by tweaking the seccomp filter?
> I would let Nick (nnk@) and Jeff (jeffv@) respond to this.
> The previous responses from both of them on this email thread
> (
> and
> suggest that the performance overhead of seccomp-bpf is too much. Kees
> also objected to it
> (
> I'm not familiar with how seccomp-bpf works. All that I can add here
> is that userfaultfd syscall is usually not invoked in a performance
> critical code path. So, if the performance overhead of seccomp-bpf (if
> enabled) is observed on all syscalls originating from a process, then
> I'd say patch 2/2 is essential. Otherwise, it should be ok to let
> seccomp perform the same functionality instead.

There are two primary reasons why seccomp isn't viable here.

1) Seccomp was never designed for whole-of-system protections, and is
impractical to deploy for anything other than "leaf" processes.
2) Attempts to enable seccomp on Android have run into performance
problems, even for trivial seccomp filters.

Let's go into each one.

Issue #1: Seccomp was never designed for whole-of-system protections,
and is impractical to deploy for anything other than "leaf" processes.

Andrea suggests deploying a seccomp filter purely focused on Android
unprivileged[1] (third party installed) apps. However, the intention
is for this security control to be used system-wide[2]. Only processes
which have a need for kernel initiated faults should be allowed to use
them; all other processes should be denied by default. And when I say
"all' processes, I mean "all" processes, even those which run with
UID=0. Andrea's proposal is akin to a denylist, where many modern
distributions (such as Android) use allowlists.

The seemingly obvious solution is to apply a global seccomp filter in
init (PID=1), but it falls down in practice. Seccomp is an incredibly
useful tool, but it wasn't designed to be applied system-wide. Seccomp
is fundamentally hierarchical in nature. A seccomp filter, once
applied, cannot be subsequently relaxed or removed in child processes.
While this restriction is great for leaf processes, it causes problems
for OS designers - a parent process must maintain an unused capability
if any process in the parent's process tree uses that capability. This
makes applying a userfaultfd seccomp filter in init impossible, since
we expect a few of init's children (but not init itself or most of
init's children) to use userfaultfd kernel faults. We end up back to a
wack-a-mole (denylist) problem of trying to modify each individual
process to block userfaultfd kernel faults, defeating the goals of
system-wide protection, and introducing significant complexity into
the system design.

Seccomp should be used in the context where it provides the most value
-- process leaf nodes. But trying to apply seccomp as a system-wide
control just isn't viable.

Lokesh's sysctl proposal doesn't have these problems. When the sysctl
is set to 2 by the OS distributor, all processes which don't have
CAP_SYS_PTRACE are denied kernel generated faults, making the system
safe-by-default. Only processes which are on the OS distributor's
CAP_SYS_PTRACE allowlist (see Android's allowlist at [3]) can generate
these faults, and capabilities can be managed without regards to
process hierarchy. This keeps the system minimally privileged and

Seccomp isn't a viable solution here.

Issue #2: Attempts to enable seccomp on Android globally have run into
performance problems, even for trivial seccomp filters.

Android has tried a few times to enable seccomp globally, but even
excluding the above-mentioned hierarchical process problems, we've
seen performance regressions across the board. Imposing a seccomp
filter merely for userfaultfd imposes a tax on every syscall, even if
the process never makes use of userfaultfd. Lokesh's sysctl proposal
avoids this tax and places the check where it's most effective, with
the rest of the userfaultfd functionality.

See also the threads that Lokesh mentioned above:


-- Nick

[1] The use of the term "unprivileged" is unfortunate. In Android,
there's no coarse-grain privileged vs unprivileged process. Each
process, including root processes, have only the privileges they need,
and not a bit more. As a concrete example, Android's init process
(PID=1) is not allowed to open TCP/UDP sockets, but is allowed to
spawn children which can do so. Having each process be differently
privileged, and ensuring that functionality is only given out on a
need-to-have basis, is an important part of modern OS design.

[2] The trend in modern exploits isn't to perform attacks directly
from untrusted code to the kernel. A lot of the attack surface needed
by an attacker isn't reachable directly from untrusted code, but only
indirectly through other processes. The attacker moves laterally
through the system, exploiting a process which has the necessary
capabilities, then escalating to the kernel. Enforcing security
controls system-wide is an important part of denying an attacker the
tools for an effective exploit and preventing this kind of lateral
movement from being useful. Denying an attacker access to kernel
initiated faults in userfaultfd system-wide (except for authorized
processes) is doubly important, as these kinds of faults are extremely
valuable to an exploit writer (see explanation at

line 107

Nick Kralevich | nnk@xxxxxxxxxx