Re: "run seccomp after ptrace" changes expose "missing PTRACE_EVENT_EXIT" bug
From: Robert O'Callahan
Date: Wed Aug 03 2016 - 19:56:59 EST
I work on rr (http://rr-project.org/), a record-and-replay
reverse-execution debugger which is a heavy user of ptrace and
seccomp. The recent change to perform syscall-entry PTRACE_SYSCALL
stops before PTRACE_EVENT_SECCOMP stops broke rr, which is fine
because I'm fixing rr and this change actually makes rr faster
(thanks!). However, it exposed an existing kernel bug which creates a
problem for us, and which I'm not sure how to fix.
The problem is that if a tracee task is in a PTRACE_EVENT_SECCOMP
trap, or has been resumed after such a trap but not yet been
scheduled, and another task in the thread-group calls exit_group(),
then the tracee task exits without the ptracer receiving a
PTRACE_EVENT_EXIT notification. Small-ish testcase here:
https://gist.github.com/rocallahan/1344f7d01183c233d08a2c6b93413068.
The bug happens because when __seccomp_filter() detects
fatal_signal_pending(), it calls do_exit() without dequeuing the fatal
signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and
that task is descheduled, __schedule() notices that there is a fatal
signal pending and changes its state from TASK_TRACED to TASK_RUNNING.
That prevents the ptracer's waitpid() from returning the ptrace event.
A more detailed analysis is here:
https://github.com/mozilla/rr/issues/1762#issuecomment-237396255.
This bug has been in the kernel for a while. rr never hit it before
because we trace all threads and mostly run only one tracee thread at
a time. Immediately after each PTRACE_EVENT_SECCOMP notification we'd
issue a PTRACE_SYSCALL to get that task to the syscall-entry
PTRACE_SYSCALL stop, so there was never an opportunity for one tracee
thread to call exit_group while another tracee was in the problematic
part of __seccomp_filter(). Unfortunately now there is no way for us
to avoid that possibility.
My guess is that __seccomp_filter() should dequeue the fatal signal it
detects before calling do_exit(), to behave more like get_signal(). Is
that correct, and if so, what would be the right way to do that?
Thanks,
Robert O'Callahan
--
lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD
selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr
.a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn