Re: [RFC 1/3] seccomp: add a return code to trap to userspace

From: Christian Brauner
Date: Thu Feb 15 2018 - 09:49:07 EST


On Wed, Feb 14, 2018 at 05:19:52PM +0000, Andy Lutomirski wrote:
> On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen <tycho@xxxxxxxx> wrote:
> > Hey Kees,
> >
> > Thanks for taking a look!
> >
> > On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
> >> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@xxxxxxxx> wrote:
> >> > This patch introduces a means for syscalls matched in seccomp to notify
> >> > some other task that a particular filter has been triggered.
> >> >
> >> > The motivation for this is primarily for use with containers. For example,
> >> > if a container does an init_module(), we obviously don't want to load this
> >> > untrusted code, which may be compiled for the wrong version of the kernel
> >> > anyway. Instead, we could parse the module image, figure out which module
> >> > the container is trying to load and load it on the host.
> >> >
> >> > As another example, containers cannot mknod(), since this checks
> >> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> >> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> >> > coding some whitelist in the kernel. Another example is mount(), which has
> >> > many security restrictions for good reason, but configuration or runtime
> >> > knowledge could potentially be used to relax these restrictions.
> >>
> >> Related to the eBPF seccomp thread, can the logic for these things be
> >> handled entirely by eBPF? My assumption is that you still need to stop
> >> the process to do something (i.e. do a mknod, or a mount) before
> >> letting it continue. Is there some "wait for notification" system in
> >> eBPF?
> >
> > I replied in the other thread
> > (https://patchwork.ozlabs.org/cover/872938/#1856642 for those
> > following along at home), but no, at least not that I know of.
>
> eBPF can call functions. One of those functions could put the caller
> to sleep. In fact, I think I once proposed doing this for the seccomp
> logging action as well.
>
> >> I wonder if this communication should be netlink, which gives a more
> >> well-structured way to describe what's on the wire? The reason I ask
> >> is because if we ever change the seccomp_data structure, we'll now
> >> have two places where we need to deal with it (the first being within
> >> the BPF itself). My initial idea was to prefix the communication with
> >> a size field, then send the structure, and then I had nightmares, and
> >> realized this was basically netlink reinvented.
> >
> > I suggested netlink in LA, and everyone (especially Andy) groaned very
> > loudly :). I'm happy to switch it to netlink if you like, although i
> > think memcpy() of structs should be safe here, since the return value
> > from read or write can indicate the size of things.
>
> I could easily get on board with "netlink" (i.e. NLA) messages sent
> over an fd. I will object strongly to the use of netlink *sockets*.

I think sending netlink messages makes perfect sense here although we
burden userspace with all those nice macros to parse these messages.
Are there already other cases where userspace gets netlink messages on
fds without having opened a netlink socket.

>
> >
> >> An ERRNO filter would block a USER_NOTIF because it's unconditional.
> >> TRACE could be either, USER_NOTIF could be either.
> >>
> >> This means TRACE rules would be bumped by a USER_NOTIF... hmm.
> >
> > Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all
> > seemed more important than USER_NOTIF, but TRACE didn't. I don't have
> > a strong opinion about what to do here, because users can adjust their
> > filters accordingly. Let me know what you prefer.
>
> If we switched to eBPF functions, this whole issue goes away.