Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

From: Kees Cook
Date: Fri May 27 2016 - 18:39:01 EST


On Fri, May 27, 2016 at 12:52 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> On May 27, 2016 11:42 AM, "Kees Cook" <keescook@xxxxxxxxxxxx> wrote:
>>
>> On Thu, May 26, 2016 at 9:45 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> > On Thu, May 26, 2016 at 7:41 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>> >> On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> >>> On Thu, May 26, 2016 at 2:04 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>> >>>> One problem with seccomp was that ptrace could be used to change a
>> >>>> syscall after seccomp filtering had completed. This was a well documented
>> >>>> limitation, and it was recommended to block ptrace when defining a filter
>> >>>> to avoid this problem. This can be quite a limitation for containers or
>> >>>> other places where ptrace is desired even under seccomp filters.
>> >>>>
>> >>>> Since seccomp filtering has been split into pre-trace and trace phases
>> >>>> (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
>> >>>> after ptrace. This makes that change, and updates the test suite for
>> >>>> both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
>> >>>
>> >>> I like fixing the hole, but I don't like this fix.
>> >>>
>> >>> The two-phase seccomp mechanism is messy. I wrote it because it was a
>> >>> huge speedup. Since then, I've made a ton of changes to the way that
>> >>> x86 syscalls work, and there are two relevant effects: the slow path
>> >>> is quite fast, and the phase-1-only path isn't really a win any more.
>> >>>
>> >>> I suggest that we fix the by simplifying the code instead of making it
>> >>> even more complicated. Let's back out the two-phase mechanism (but
>> >>> keep the ability for arch code to supply seccomp_data) and then just
>> >>> reorder it so that seccomp happens after ptrace. The result should be
>> >>> considerably simpler. (We'll still have to answer the question of
>> >>> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
>> >>> maybe the answer is to just let it through -- after all,
>> >>> SECCOMP_RET_TRACE might be a request by a tracer to do its own
>> >>> internal filtering.)
>> >>
>> >> I'm really against this. I think seccomp needs to stay first,
>> >
>> > Why? What use case is improved with it going first?
>>
>> I feel that the critical purpose of seccomp is to minimize attack
>> surface. To that end, I am strongly against anything coming before it
>> in the syscall path. I really do not want ptrace going first, I think
>> it's just asking for bugs.
>
> I disagree in this case. There's no actual code surface opened up.
> If seccomp allows even a single syscall through and there's a ptracer
> attached, then the ptrace code is exposed. As far as ptrace is
> concerned, the syscall number is just a number, and ptrace has
> basically no awareness of the arguments.

No, I completely disagree: there is a significant amount of surface
exposed. With a tracer attached there is significantly more happening
before the filter would be checked. Even less obvious things like
signal delivery, etc get exposed. Seccomp must be first -- this is
it's basic design principle. Bugs creep in, unexpected combinations
creep in, etc. Seccomp must mitigate this and be first on the syscall
path. The paranoia of this design principle must remain in place, even
at the expense of some inelegant results.

>> >> and I
>> >> like the two-phase split because it gives us a lot of flexibility on
>> >> other architectures.
>> >
>> > I thought so too when I wrote it, and I even tried a bit to evangelize
>> > it to other arch maintainers. So far, it's used *only* in x86, and it
>> > would IMO be a cleanup to stop using it in x86 now. Given my
>> > experience cleanup up the x86 syscall path, my current advice to other
>> > arch maintainers would be to try hard to avoid having a context in
>> > which syscall args are known but ptrace can't be invoked (as x86 had
>> > before Linux 4.5).
>>
>> Well, I've got most of the ARM 2-phase port done, but haven't gotten
>> it all the way finished. But I could be talked into removing the
>> 2-phase just from the perspective of reducing complexity.
>
> Does it help anything? It's certainly more complex and harder to audit.
>
> On x86, it used to save hundreds of cycles. Now it's probably five
> cycles or so at most. It could even be a loss because of increased
> code size.

Yeah, on ARM I wasn't seeing much benefit, so I'm okay with dropping 2-phase.

>> >> And we can't just let through RET_TRACE because
>> >> we'll have exactly the same problem: a process can add a RET_TRACE
>> >> filter for some syscall and then change it arbitrarily to escape the
>> >> filtering. The non-trace returns of seccomp need to be check first and
>> >> after ptrace manipulations. The patch seems like the best approach and
>> >> it covers all the corners.
>> >
>> > But RET_TRACE really is special.
>> >
>> > Suppose you have a tracer and you use SECCOMP_RET_TRACE. If the
>> > tracer sees a syscall, approves, and calls PTRACE_CONT, then the
>> > syscall will be allowed, whereas the effect of SECCOMP_RET_TRACE run
>> > anew would be to either force -ENOSYS or to trap back to the tracer,
>> > depending on whether there is a tracer. Your patch has a
>> > SECCOMP_RET_TRACE special case, whereas my approach wouldn't need a
>> > special case.
>>
>> But after the seccomp re-trap, we'd still have to re-check the
>> filters.
>
> Why? We don't do the now and, as far as I know, it's not a problem.

I feel like we're not talking about the same thing here. :) So, right
now, there are two places where we can turn over control to the
tracer: during tracerhook entry, and during phase2 RET_TRACE handling.
In both places we need to re-run phase1 to make sure any changes to
the syscall get seen by the seccomp filter.

> If we change that, I think it should be its own patch.

This was the patch. :)

>> So it's not cleaner, and we gain attack surface. Don't get me
>> wrong, I totally see why you're suggesting doing ptrace first, but I
>> still think that attack surface reduction must remain the primary
>> principle of seccomp.
>>
>> > I think your patch also has a minor hole: if you have
>> > SECCOMP_RET_TRACE *and* a tracer that's catching syscalls directly
>> > (PTRACE_SYSCALL), then the PTRACE_SYSCALL action can modify a syscall
>> > after TRACE does its thing but before recheck, and can then redirect
>> > to another RET_TRACE action that would otherwise be denied. This is
>> > minor because it could only happen if the tracer actively fights with
>> > itself.
>>
>> So, a few thoughts went into the patch design, and here's why I don't
>> think this hole is a hole (which you already talk about a bit):
>> - no filtered syscall could be suddenly made to execute (this is core
>> goal, obviously)
>> - worst case, the tracer doesn't notice a syscall marked for
>> RET_TRACE, but that would be it's own fault because it chose to put
>> itself in that state.
>> - RET_TRACE is just under RET_ALLOW in priority, so even if there were
>> some really weird unintended side-effects, we still wouldn't be
>> allowing a syscall.
>>
>> > Finally, I think that the your approach would break an existing valid
>> > use case. Suppose I have a tracer that wants to intercept some
>> > syscall sys_foo (using SECCOMP_RET_TRACE) and, when it sees a sys_foo
>> > attempt, it will implement it by redirecting it so some other syscall
>> > that wouldn't be allowed if called directly (i.e. it would return
>> > SECCOMP_RET_KILL or similar). Currently, it'll work. With your
>> > patch, it will kill the tracee. I think the former behavior is
>> > better. On the flip side, if you write a program that uses
>> > SECCOMP_RET_TRACE, you more or less have to trust the tracer to begin
>> > with.
>>
>> Well, you say "valid" here, but I actually think this is a
>> mis-feature. Luckily, nothing actually uses this behavior, and
>> multiple seccomp users want to see this fixed. As for tracer trust,
>> yes. RET_TRACE was designed for a tracer to do things on behalf of the
>> process, not to redirect it through an otherwise blocked syscall. The
>> use cases for that behavior is few and far between (and is currently
>> unused), where-as the use case for actually blocking the syscalls the
>> filters claim to block is much more desired.
>
> Then let's make that a separate patch.
>
> The logic could be:
>
> if (ptraced)
> invoke ptrace;
>
> if (seccomp) {
> evaluate the filter;
> if (result == SECCOMP_RET_TRACE) {
> evaluate filter again;
> if (result != SECCOMP_RET_TRACE)
> perform the action;
> }

We must run seccomp first. So:

if (seccomp) {
evaluate the filter;
if (result == SECCOMP_RET_TRACE) {
invoke ptrace;
evaluate filter again;
}
}

if (ptraced) {
invoke ptrace;
evaluate filter again;
}

Which is what I've already sent. If we want to avoid multiple
re-evaluations in the worst-case tracer state, we could delay
re-evaluation until after the ptrace invocation, but I don't think
that's worth the design complication (since it requires per-arch
changes). Doing it after RET_TRACE and after ptrace gets done in
common code, and would be hit so rarely (a tracer with both RET_TRACE
and PTRACE_SYSCALL?!) as to be irrelevant.

> and we could do it in three steps:
>
> 1. Remove two-phase seccomp.
>
> 2. Move ptrace first.
>
> 3. Add the re-check on SECCOMP_RET_TRACE.
>
>>
>> > One more reason to prefer my approach: currently, if you strace a
>> > process that gets killed by SECCOMP_RET_KILL, you can't tell what
>> > killed it. For example:
>> >
>> > prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, {len = 1, filter =
>> > 0x7ffe7b2b7d30}) = 0
>> > +++ killed by SIGSYS +++
>> > Bad system call (core dumped)
>> >
>> > With my approach, strace will have the IMO much more sensible behavior
>> > of showing the fatal syscall entry before showing the "killed by
>> > SIGSYS".
>>
>> Right, I know, it's aesthetically much nicer that way, but I really
>> want to stay totally paranoid and keep seccomp absolutely first on the
>> path.
>>
>> How about this: we'll use this patch as-is for now, since I'd like to
>> be able to start getting feedback from the container-using folks ASAP,
>> and then we can redesign the 2-phase system going forward from there.
>>
>
> I think I'd rather change the ABI as few times as possible. On the
> other hand, it's still early, and I see nothing wrong with adding it
> to -next.

I much prefer to use what I sent originally unless it isn't actually
doing what it intends to be doing (i.e. there is still a hole I don't
see, etc).

-Kees

--
Kees Cook
Chrome OS & Brillo Security