Re: [PATCH] seccomp: plug syscall-dodging ptrace hole

From: Kees Cook
Date: Fri May 27 2016 - 22:35:50 EST

On Fri, May 27, 2016 at 4:20 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> On May 27, 2016 3:38 PM, "Kees Cook" <keescook@xxxxxxxxxxxx> wrote:
>> On Fri, May 27, 2016 at 12:52 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> > On May 27, 2016 11:42 AM, "Kees Cook" <keescook@xxxxxxxxxxxx> wrote:
>> >>
>> >> On Thu, May 26, 2016 at 9:45 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> >> > On Thu, May 26, 2016 at 7:41 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>> >> >> On Thu, May 26, 2016 at 7:10 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> >> >>> On Thu, May 26, 2016 at 2:04 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>> >> >>>> One problem with seccomp was that ptrace could be used to change a
>> >> >>>> syscall after seccomp filtering had completed. This was a well documented
>> >> >>>> limitation, and it was recommended to block ptrace when defining a filter
>> >> >>>> to avoid this problem. This can be quite a limitation for containers or
>> >> >>>> other places where ptrace is desired even under seccomp filters.
>> >> >>>>
>> >> >>>> Since seccomp filtering has been split into pre-trace and trace phases
>> >> >>>> (phase1 and phase2 respectively), it's possible to re-run phase1 seccomp
>> >> >>>> after ptrace. This makes that change, and updates the test suite for
>> >> >>>> both SECCOMP_RET_TRACE and PTRACE_SYSCALL manipulation.
>> >> >>>
>> >> >>> I like fixing the hole, but I don't like this fix.
>> >> >>>
>> >> >>> The two-phase seccomp mechanism is messy. I wrote it because it was a
>> >> >>> huge speedup. Since then, I've made a ton of changes to the way that
>> >> >>> x86 syscalls work, and there are two relevant effects: the slow path
>> >> >>> is quite fast, and the phase-1-only path isn't really a win any more.
>> >> >>>
>> >> >>> I suggest that we fix the by simplifying the code instead of making it
>> >> >>> even more complicated. Let's back out the two-phase mechanism (but
>> >> >>> keep the ability for arch code to supply seccomp_data) and then just
>> >> >>> reorder it so that seccomp happens after ptrace. The result should be
>> >> >>> considerably simpler. (We'll still have to answer the question of
>> >> >>> what happens when a SECCOMP_RET_TRACE event changes the syscall, but
>> >> >>> maybe the answer is to just let it through -- after all,
>> >> >>> SECCOMP_RET_TRACE might be a request by a tracer to do its own
>> >> >>> internal filtering.)
>> >> >>
>> >> >> I'm really against this. I think seccomp needs to stay first,
>> >> >
>> >> > Why? What use case is improved with it going first?
>> >>
>> >> I feel that the critical purpose of seccomp is to minimize attack
>> >> surface. To that end, I am strongly against anything coming before it
>> >> in the syscall path. I really do not want ptrace going first, I think
>> >> it's just asking for bugs.
>> >
>> > I disagree in this case. There's no actual code surface opened up.
>> > If seccomp allows even a single syscall through and there's a ptracer
>> > attached, then the ptrace code is exposed. As far as ptrace is
>> > concerned, the syscall number is just a number, and ptrace has
>> > basically no awareness of the arguments.
>> No, I completely disagree: there is a significant amount of surface
>> exposed. With a tracer attached there is significantly more happening
>> before the filter would be checked. Even less obvious things like
>> signal delivery, etc get exposed. Seccomp must be first -- this is
>> it's basic design principle. Bugs creep in, unexpected combinations
>> creep in, etc. Seccomp must mitigate this and be first on the syscall
>> path. The paranoia of this design principle must remain in place, even
>> at the expense of some inelegant results.
> But this only works if the filter is literally "deny everything". If
> there is even a single syscall allowed and a ptracer is attached, then
> the whole ptrace machinery is exposed anyway.

That's an excellent point. Thanks for persisting on this, I'm starting
to come around. :)

> Users who are this paranoid about attack surface need to disable
> ptrace, full stop. If you can do the ptrace(2) syscall, then you can
> invoke all the nasty code paths by yourself, and there is nothing
> seccomp can do about it. All seccomp can do is prevent ptrace from
> generating a syscall that would otherwise be filtered out.
> Let's look at the actual supposed attack surface:
> if (unlikely(work & _TIF_SYSCALL_EMU))
> ret = -1L;
> if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
> tracehook_report_syscall_entry(regs))
> ret = -1L;
> That's all.

Yeah, looking through this, I see that audit isn't part of this, so
I'm much more relaxed.

> The only way that TIF_SYSCALL_EMU or TIF_SYSCALL_TRACE gets set is if
> a ptracer is attached and uses PTRACE_SYSEMU, PTRACE_SYSCALL or
> similar. If that has happened, then there's very, very little that
> seccomp can possibly do to reduce attack surface. First, literally
> any syscall that results in SECCOMP_RET_OK will cause all of the same
> kernel code paths to run. Second, most of those code paths can be
> triggered without any syscall at all by using PTRACE_SINGLESTEP
> instead.

Agreed: the attack surface on ptrace is unavoidable under normal
conditions. All the more reason to set Yama's sysctl to 2! ;)

> So I challenge you to find a realistic scenario (i.e. something that a
> real program might actually program into seccomp) in which running
> seccomp before ptrace avoids even a single line of code worth of
> attack surface.
> On the flip side, changing the order has lots of benefits:
> 1. strace can actually be used to figure out why your program is
> getting killed by seccomp. (IIRC there was a long article about
> trying to debug something in the chromium graphics sandbox that was
> resulting in a seccomp kill. strace would have make it easier if it
> worked.)
> 2. The infamous ptrace hold gets plugged with much less code (and, if
> we want to plug it for SECCOMP_RET_TRACE, that's still less code than
> with your patch). I would argue that reducing the amount of code in
> seccomp is more important from an attack surface perspective than
> restricting the values that a bunch of integers that the kernel's
> ptrace code doesn't even care about.
> 3. IMO it makes more sense. A debugger is trying to debug. I think
> it makes more sense for it to see syscalls as requested by the issuing
> process rather than syscalls as modified by seccomp.

Yeah, it's much much cleaner and has more expected results when
reordered. The down side is that this needs to be changed on a
per-architecture basis. (But that might be a good time to pre-populate
struct seccomp_data.)

Hmmm. Reordering may actually let this not be arch-specific any more,
but I dunno... I'll look through it...


Kees Cook
Chrome OS & Brillo Security