Re: A peculiarity in ptrace/waitpid behavior

From: Pavel Labath
Date: Mon Mar 23 2015 - 05:42:43 EST


On 21 March 2015 at 18:57, Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
> On 03/20, Pavel Labath wrote:
>>
>> One difference I see though is that in
>> our test, we are not sending any additional signals to the thread in
>> question (at least we shouldn't be sending them, but we are sending some
>> signals to other threads in the same process). Do you think it could still
>> be the same issue?
>
> Not sure...
>
> And. I found another race, which looks more promising wrt your description.
> ptrace_resume() sets ->exit_code before it wakes the tracee up. If the
> tracer's sub-thread calls wait() right after that, it can wrongly see
> task_stopped_code(tracee, true) != 0, as if the tracee reports its
> ->exit_code.
Yes, that sounds very much like what is happening. :)


>
>> I would be happy to test your patch. I don't think I can patch the kernel
>> on my work machine directly, but I think I might be able to set up some
>> sort of a test environment to try it out.
>
> Thanks! could you try the patch below? It won't help my test-case, but
> _perhaps_ it can fix the problem you hit?
Thanks for working on this. I will get on testing this. It will take a
while though, because the failure is quite rare and usually only shows
up in a full test suite run, which takes over 5 minutes. :)

>
> And a couple of questions just in case...
>
> Which kernel version? Although probably this doesn't matter, this race
> is very-very old.
Linux labath 3.13.0-46-generic #79-Ubuntu SMP Tue Mar 10 20:06:50 UTC
2015 x86_64 x86_64 x86_64 GNU/Linux
It should be the kernel shipped with Ubuntu 14.04 LTS.

>
> Let me return to your description,
>
> 1) we get a waitpid() notification that the tracee got SIGUSR1
> 2) we do a ptrace(GETSIGINFO) to get more info
> 3) eventually we decide to restart the tracee with PTRACE_CONT, passing it
> SIGUSR1
> 4) immediately after that we get another waitpid notification, again with
> SIGUSR1,
>
> Does this "waitpid notification" mean that _another_ thread returns
> from waitpid() ?
Yes, we have one thread doing waitpid() processing in a loop, and
another thread is issuing PTRACE requests.

> And status == (SIGUSR1 << 8) | 0x7f , yes? IOW, is WIFSTOPPED() true?

Exactly. The following is an excerpt from our log, which I was able to capture.
548 1426700116.644772000 [7295/729a]: NativeProcessLinux::Resume()
resuming thread = 29346 with signal SIGUSR1
554 1426700116.644889000 [7295/729c]:
MonitorChildProcessThreadFunction ::waitpid (pid = -29339, &status,
options = 1073741824)...
555 1426700116.644896000 [7295/729c]:
MonitorChildProcessThreadFunction ::waitpid (pid = -29339, &status,
options = 1073741824) => pid = 29346, status = 0x00000a7f (STOPPED),
signal = 10, exit_state = 0
556 1426700116.644906000 [7295/729a]: NativeProcessLinux::Resume()
resuming thread = 29346 result = true

The PTRACE_CONT call happens somewhere between lines 548 and 556,
while the waitpid calls happens between 554 and 555. So it is quite
possible that they are called simultaneously...


Thanks for working on this.

cheers,
pl
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/