Re: Question regarding ptrace work for LInux v3.1

From: Patrick Donnelly
Date: Mon Mar 21 2016 - 15:24:19 EST


On Mon, Mar 21, 2016 at 3:07 PM, Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
> On 03/21, Patrick Donnelly wrote:
>>
>> That seems to be the case but it will only report certain events (not
>> syscalls). I have observed PTRACE_EVENT_EXIT and PTRACE_EVENT_CLONE
>> events... Hmm, now that I think about this, it would be necessary to
>> see the initial SIGSTOP (or PTRACE_EVENT_STOP) in order to initiate
>> syscall tracing via PTRACE_SYSCALL. So that does seem to indicate the
>> problem.
>
> Yes, exactly, you need to see the initial SIGSTOP or another event which
> can be reported before it.

Assuming a SIGSTOP is being silenced, is there anything we can do to
forcibly start tracing syscalls? (For kernels without PTRACE_SEIZE)

>> > To clarify, the usage of SIGSTOP in ptrace was always buggy by design.
>> > For example, SIGCONT from somewhere can remove the pending (and not yet
>> > reported) SIGSTOP, and this _can_ explain the problem you hit.
>>
>> The tree of processes being traced do no send any signals but an
>> external process may have.
>
> I am looking into
>
> https://github.com/cooperative-computing-lab/cctools/blob/5ccb04599ba2ee125730981f53add80d98cf8161/parrot/src/pfs_main.cc
>
> and this code
>
> case SIGSTOP:
> /* Black magic to get threads working on old Linux kernels... */
>
> if(p->nsyscalls == 0) { /* stop before we begin running the process */
> debug(D_DEBUG, "suppressing bootstrap SIGSTOP for %d",pid);
> signum = 0; /* suppress delivery */
> kill(p->pid,SIGCONT);
> }
> break;
>
> doesn't look right. Note that kill(pid,SIGCONT) affects the whole thread-
> group. So if this kill() races with another thread doing clone() you can
> hit the problem you described.

You're right, that should be tkill! I will give that a try and report
back if that solved the issue for our collaborators...

>> > But unless you use PTRACE_SEIZE the same can happen on v3.1 so it seems
>> > there is something else.
>>
>> Okay, it might be that PTRACE_SEIZE fixes it.
>
> Yes, but iiuc you do not see this problem on v3.1 even with PTRACE_ATTACH?

I have not tested on >v3.1 with PTRACE_ATTACH. As you know, v3.1 was
when the PTRACE_SEIZE code was merged along with many other changes.
[I actually thought the merge occurred in 3.4 because of the ptrace
man page. I have submitted a bug report to get that fixed.] I have not
had any reports of the problem with Linux versions after and including
v3.1.

Again, I will see if the kill system call was the cause and report
back if so. Thanks for taking the time to look at the code!

--
Patrick Donnelly