Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

From: Andy Lutomirski
Date: Fri May 27 2016 - 00:45:33 EST

On Thu, May 26, 2016 at 7:41 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> On Thu, May 26, 2016 at 2:04 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>>> One problem with seccomp was that ptrace could be used to change a
>>> syscall after seccomp filtering had completed. This was a well documented
>>> limitation, and it was recommended to block ptrace when defining a filter
>>> to avoid this problem. This can be quite a limitation for containers or
>>> other places where ptrace is desired even under seccomp filters.
>>> Since seccomp filtering has been split into pre-trace and trace phases
>>> (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
>>> after ptrace. This makes that change, and updates the test suite for
>>> both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
>> I like fixing the hole, but I don't like this fix.
>> The two-phase seccomp mechanism is messy. I wrote it because it was a
>> huge speedup. Since then, I've made a ton of changes to the way that
>> x86 syscalls work, and there are two relevant effects: the slow path
>> is quite fast, and the phase-1-only path isn't really a win any more.
>> I suggest that we fix the by simplifying the code instead of making it
>> even more complicated. Let's back out the two-phase mechanism (but
>> keep the ability for arch code to supply seccomp_data) and then just
>> reorder it so that seccomp happens after ptrace. The result should be
>> considerably simpler. (We'll still have to answer the question of
>> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
>> maybe the answer is to just let it through -- after all,
>> SECCOMP_RET_TRACE might be a request by a tracer to do its own
>> internal filtering.)
> I'm really against this. I think seccomp needs to stay first,

Why? What use case is improved with it going first?

> and I
> like the two-phase split because it gives us a lot of flexibility on
> other architectures.

I thought so too when I wrote it, and I even tried a bit to evangelize
it to other arch maintainers. So far, it's used *only* in x86, and it
would IMO be a cleanup to stop using it in x86 now. Given my
experience cleanup up the x86 syscall path, my current advice to other
arch maintainers would be to try hard to avoid having a context in
which syscall args are known but ptrace can't be invoked (as x86 had
before Linux 4.5).

> And we can't just let through RET_TRACE because
> we'll have exactly the same problem: a process can add a RET_TRACE
> filter for some syscall and then change it arbitrarily to escape the
> filtering. The non-trace returns of seccomp need to be check first and
> after ptrace manipulations. The patch seems like the best approach and
> it covers all the corners.

But RET_TRACE really is special.

Suppose you have a tracer and you use SECCOMP_RET_TRACE. If the
tracer sees a syscall, approves, and calls PTRACE_CONT, then the
syscall will be allowed, whereas the effect of SECCOMP_RET_TRACE run
anew would be to either force -ENOSYS or to trap back to the tracer,
depending on whether there is a tracer. Your patch has a
SECCOMP_RET_TRACE special case, whereas my approach wouldn't need a
special case.

I think your patch also has a minor hole: if you have
SECCOMP_RET_TRACE *and* a tracer that's catching syscalls directly
(PTRACE_SYSCALL), then the PTRACE_SYSCALL action can modify a syscall
after TRACE does its thing but before recheck, and can then redirect
to another RET_TRACE action that would otherwise be denied. This is
minor because it could only happen if the tracer actively fights with

Finally, I think that the your approach would break an existing valid
use case. Suppose I have a tracer that wants to intercept some
syscall sys_foo (using SECCOMP_RET_TRACE) and, when it sees a sys_foo
attempt, it will implement it by redirecting it so some other syscall
that wouldn't be allowed if called directly (i.e. it would return
SECCOMP_RET_KILL or similar). Currently, it'll work. With your
patch, it will kill the tracee. I think the former behavior is
better. On the flip side, if you write a program that uses
SECCOMP_RET_TRACE, you more or less have to trust the tracer to begin

One more reason to prefer my approach: currently, if you strace a
process that gets killed by SECCOMP_RET_KILL, you can't tell what
killed it. For example:

prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, {len = 1, filter =
0x7ffe7b2b7d30}) = 0
+++ killed by SIGSYS +++
Bad system call (core dumped)

With my approach, strace will have the IMO much more sensible behavior
of showing the fatal syscall entry before showing the "killed by