Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only

From: Andrea Arcangeli
Date: Wed May 20 2020 - 15:48:23 EST


Hello Kees,

On Wed, May 20, 2020 at 11:03:39AM -0700, Kees Cook wrote:
> Err, did I miss a separate 6-patch series? I can't find anything on lore.

Daniel included the link of the previous series I referred to is the
cover letter 0/2:

https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@xxxxxxxxxx/

> > If you keep only 1/2, can't seccomp-bpf enforce userfaultfd to be
> > always called with flags==0x1 without requiring extra modifications in
> > the kernel?
>
> Please no. This is way too much overhead for something that a system
> owner wants to enforce globally. A sysctl is the correct option here,
> IMO. If it needs to be a per-userns sysctl, that would be fine too.

The question is who could be this system owner who prefers "2" to "0"?

per-ns I don't see the point either when all containers already run
with default policies enforcing the same behavior as if the sysctl is
set to "0".

Why exactly is it preferable to enlarge the surface of attack of the
kernel and take the risk there is a real bug in userfaultfd code (not
just a facilitation of exploiting some other kernel bug) that leads to
a privilege escalation, when you still break 99% of userfaultfd users,
if you set with option "2"?

Is the system owner really going to purely run on his systems CRIU
postcopy live migration (which already runs with CAP_SYS_PTRACE) and
nothing else that could break?

Option "2" to me looks with a single possible user, and incidentally
this single user can already enforce model "2" by only tweaking its
seccomp-bpf filters without applying 2/2. It'd be a bug if android
apps runs unprotected by seccomp regardless of 2/2.

System owners I think would be better of to stick to "0" or "1" a far
as I can tell, "2" looks a bad tradeoff as system value, with nearly
all cons of "0", but less secure than "0".

> I'd agree that patch 1 should land, as it appears to be required for any

Agreed about merging 1/2.

> further policy considerations. I'm still a big fan of a sysctl since
> this is the kind of thing I would absolutely turn on globally for all my
> systems.

The sysctl /proc/sys/kernel/unprivileged_bpf_disabled is already there
upstream and you should have already set it to "0" in all your systems
if you can cope with some app features not working.

2/2 as modified with Peter's suggestion, would add a new value "2" (in
addition of "1" and "0"), that still breaks the majority of all
possible users, just like value "0", but that gives less security than
value "0".

It all boils down of how peculiar it is to be able to leverage only
the acceleration (reduction in context switches and enter/exit kernel)
and the vma fragmentation avoidance (to avoid running of vmas in
/proc/sys/vm/max_map_count) provided by userfaultfd (vs sigsegv
trapping) but not the transparency in handling faults in kernel (which
sigsegv can't possibly achieve).

Right now there's a single user that can cope with that limitation,
and it's not your running on your servers but on your phone. Even CRIU
cannot cope with such limitation, the only reason it would cope with
value "2" is that it already runs with CAP_SYS_PTRACE for other
reasons.

If there will be more users that can cope with handling only user
initiated page faults, then yes, I think it'd be fine to add a value
"2" later.

The other benefit of enforcing the policy with seccomp-bpf if that if
you'd run Android userland on a container on a ARM server on top of an
enterprise kernel, it'd already run as safe as if the sysctl value was
tweaked to "2", but without having to also add extra kernel code for
per-ns sysctl. It just looks simpler. The rest of the containers on
the same host are already running today as if the sysctl is set to 0
regardless of this patchset.

If you want to enforce maximum security and override any possible
opt-out of the default podman seccomp profile that blocks userfaultfd,
you already can upstream with the global sysctl by setting it to "0".

Thanks,
Andrea