Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

From: Jann Horn
Date: Wed Oct 28 2020 - 19:52:00 EST


On Wed, Oct 28, 2020 at 7:35 PM Rich Felker <dalias@xxxxxxxx> wrote:
> On Wed, Oct 28, 2020 at 07:25:45PM +0100, Jann Horn wrote:
> > On Wed, Oct 28, 2020 at 6:52 PM Rich Felker <dalias@xxxxxxxx> wrote:
> > > On Wed, Oct 28, 2020 at 06:34:56PM +0100, Jann Horn wrote:
> > > > On Wed, Oct 28, 2020 at 5:49 PM Rich Felker <dalias@xxxxxxxx> wrote:
> > > > > On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote:
> > > > > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey <commial@xxxxxxxxx> wrote:
> > > > > > You're just focusing on execve() - I think it's important to keep in
> > > > > > mind what happens after execve() for normal, dynamically-linked
> > > > > > binaries: The next step is that the dynamic linker runs, and it will
> > > > > > poke around in the file system with access() and openat() and fstat(),
> > > > > > it will mmap() executable libraries into memory, it will mprotect()
> > > > > > some memory regions, it will set up thread-local storage (e.g. using
> > > > > > arch_prctl(); even if the process is single-threaded), and so on.
> > > > > >
> > > > > > The earlier you install the seccomp filter, the more of these steps
> > > > > > you have to permit in the filter. And if you want the filter to take
> > > > > > effect directly after execve(), the syscalls you'll be forced to
> > > > > > permit are sufficient to cobble something together in userspace that
> > > > > > effectively does almost the same thing as execve().
> > > > >
> > > > > I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for
> > > > > controlling these operations and allowing only the ones that are valid
> > > > > during dynamic linking. This also allows you to defer application of
> > > > > the filter until after execve. So unless I'm missing some reason why
> > > > > this doesn't work, I think the requested functionality is already
> > > > > available.
> > > >
> > > > Ah, yeah, good point.
> > > >
> > > > > If you really just want the "activate at exec" behavior, it might be
> > > > > possible (depending on how SECCOMP_RET_USER_NOTIF behaves when there's
> > > > > no notify fd open; I forget)
> > > >
> > > > syscall returns -ENOSYS. Yeah, that'd probably do the job. (Even
> > > > though it might be a bit nicer if userspace had control over the errno
> > > > there, such that it could be EPERM instead... oh well.)
> > >
> > > EPERM is a major bug in current sandbox implementations, so ENOSYS is
> > > at least mildly better, but indeed it should be controllable, probably
> > > by allowing a code path for the BPF to continue with a jump to a
> > > different logic path if the notify listener is missing.
> >
> > I guess we might be able to expose the listener status through a bit /
> > a field in the struct seccomp_data, and then filters could branch on
> > that. (And the kernel would run the filter twice if we raced with
> > filter detachment.) I don't know whether it would look pretty, but I
> > think it should be doable...
>
> I was thinking the race wouldn't be salvagable, but indeed since the
> filter is side-effect-free you can just re-run it if the status
> changes between start of filter processing and the attempt at
> notification. This sounds like it should work.
>
> I guess it's not possible to chain two BPF filters to do this, because
> that only works when the first one allows? Or am I misunderstanding
> the multiple-filters case entirely? (I've never gotten that far with
> programming it.)

I'm not sure if I'm understanding the question correctly...
At the moment you basically can't have multiple filters with notifiers.
The rule with multiple filters is always that all the filters get run,
and the actual action taken is the most restrictive result of all of
them.