Re: "run seccomp after ptrace" changes expose "missing PTRACE_EVENT_EXIT" bug
From: Kees Cook
Date: Thu Aug 04 2016 - 01:24:35 EST
On Wed, Aug 3, 2016 at 4:51 PM, Robert O'Callahan <robert@xxxxxxxxxxxxx> wrote:
> I work on rr (http://rr-project.org/), a record-and-replay reverse-execution
> debugger which is a heavy user of ptrace and seccomp. The recent change to
> perform syscall-entry PTRACE_SYSCALL stops before PTRACE_EVENT_SECCOMP stops
> broke rr, which is fine because I'm fixing rr and this change actually makes
> rr faster (thanks!). However, it exposed an existing kernel bug which
> creates a problem for us, and which I'm not sure how to fix.
>
> The problem is that if a tracee task is in a PTRACE_EVENT_SECCOMP trap, or
> has been resumed after such a trap but not yet been scheduled, and another
> task in the thread-group calls exit_group(), then the tracee task exits
> without the ptracer receiving a PTRACE_EVENT_EXIT notification. Small-ish
> testcase here:
> https://gist.github.com/rocallahan/1344f7d01183c233d08a2c6b93413068.
>
> The bug happens because when __seccomp_filter() detects
> fatal_signal_pending(), it calls do_exit() without dequeuing the fatal
> signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that
> task is descheduled, __schedule() notices that there is a fatal signal
> pending and changes its state from TASK_TRACED to TASK_RUNNING. That
> prevents the ptracer's waitpid() from returning the ptrace event. A more
> detailed analysis is here:
> https://github.com/mozilla/rr/issues/1762#issuecomment-237396255.
>
> This bug has been in the kernel for a while. rr never hit it before because
> we trace all threads and mostly run only one tracee thread at a time.
> Immediately after each PTRACE_EVENT_SECCOMP notification we'd issue a
> PTRACE_SYSCALL to get that task to the syscall-entry PTRACE_SYSCALL stop, so
> there was never an opportunity for one tracee thread to call exit_group
> while another tracee was in the problematic part of __seccomp_filter().
> Unfortunately now there is no way for us to avoid that possibility.
>
> My guess is that __seccomp_filter() should dequeue the fatal signal it
> detects before calling do_exit(), to behave more like get_signal(). Is that
> correct, and if so, what would be the right way to do that?
Thanks for the detailed analysis! I'll take a look at what can be done
here. Off the top of my head, I don't see a problem with what you're
suggesting. Let me see what I can come up with.
-Kees
>
> Thanks,
> Robert O'Callahan
> --
> lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf
> toD
> selthor stor edna siewaoeodm or v sstvr esBa kbvted,t
> rdsme,aoreseoouoto
> o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr
> .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr
> esn
--
Kees Cook
Brillo & Chrome OS Security