Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas

From: Andy Lutomirski
Date: Sun May 31 2020 - 14:57:19 EST


On Sun, May 31, 2020 at 11:36 AM Paul Gofman <gofmanp@xxxxxxxxx> wrote:
>
> On 5/31/20 21:10, Andy Lutomirski wrote:
> >
> > That's not what I meant. I meant that you would set the kernel up to
> > redirect *all* syscalls from the thread with the sole exception of one
> > syscall instruction in the thunk. This would catch Windows syscalls
> > and Linux syscalls. The thunk would determine whether the original
> > syscall was Linux or Windows and handle it accordingly.
> >
> > This may interact poorly with the DRM scheme. The redzone might need
> > to be respected, or stack switching might be needed.
>
> Oh yeah, I see now, thanks. Sure, we could trap every syscall and have a
> Seccomp-allowed trampoline for executing native ones with the existing
> Seccomp implementation. But this is going to have prohibitive
> performance impact. Our present use case specifics is that vast majority
> of syscalls do not need to be emulated, they are native. And just a few
> go from the Windows application which we need to trap and route to our
> handler to let the program continue, while we do not care too much about
> the overhead for those few. So the hope was that the kernel can route
> that majority of Linux native syscalls inside with the minor overhead.
> I've read the suggestion to use SECCOMP_RET_USER_NOTIF instead of
> SECCOMP_RET_TRAP, is handling the trap this way supposed to be much
> quicker than handling the sigsys from SECCOMP_RET_TRAP? More
> specifically, would not SECCOMP_RET_USER_NOTIF effectively serialize all
> the syscalls waiting in a single queue for processing, while
> SECCOMP_RET_TRAP can be processed without exclusive locking?
>
>

Using SECCOMP_RET_USER_NOTIF is likely to be considerably more
expensive than my scheme. On a non-PTI system, my approach will add a
few tens of ns to each syscall. On a PTI system, it will be worse.
But using any kind of notifier for all syscalls will cause a context
switch to a different user program for each syscall, and that will be
much slower.

I think that the implementation may well want to live in seccomp, but
doing this as a seccomp filter isn't quite right. It's not a security
thing -- it's an emulation thing. Seccomp is all about making
inescapable sandboxes, but that's not what you're doing at all, and
the fact that seccomp filters are preserved across execve() sounds
like it'll be annoying for you.

What if there was a special filter type that ran a BPF program on each
syscall, and the program was allowed to access user memory to make its
decisions, e.g. to look at some list of memory addresses. But this
would explicitly *not* be a security feature -- execve() would remove
the filter, and the filter's outcome would be one of redirecting
execution or allowing the syscall. If the "allow" outcome occurs,
then regular seccomp filters run. Obviously the exact semantics here
would need some care.