Re: Threads stuck in zap_pid_ns_processes()

From: Eric W. Biederman
Date: Sat May 13 2017 - 14:28:29 EST


Guenter Roeck <linux@xxxxxxxxxxxx> writes:

> On 05/12/2017 01:03 PM, Eric W. Biederman wrote:
>> Guenter Roeck <linux@xxxxxxxxxxxx> writes:
>>
>>> Hi Eric,
>>>
>>> On Fri, May 12, 2017 at 12:33:01PM -0500, Eric W. Biederman wrote:
>>>> Guenter Roeck <linux@xxxxxxxxxxxx> writes:
>>>>
>>>>> Hi Eric,
>>>>>
>>>>> On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
>>>>>> Vovo Yang <vovoy@xxxxxxxxxx> writes:
>>>>>>
>>>>>>> On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
>>>>>>> <ebiederm@xxxxxxxxxxxx> wrote:
>>>>>>>> Guenter Roeck <linux@xxxxxxxxxxxx> writes:
>>>>>>>>
>>>>>>>>> What I know so far is
>>>>>>>>> - We see this condition on a regular basis in the field. Regular is
>>>>>>>>> relative, of course - let's say maybe 1 in a Milion Chromebooks
>>>>>>>>> per day reports a crash because of it. That is not that many,
>>>>>>>>> but it adds up.
>>>>>>>>> - We are able to reproduce the problem with a performance benchmark
>>>>>>>>> which opens 100 chrome tabs. While that is a lot, it should not
>>>>>>>>> result in a kernel hang/crash.
>>>>>>>>> - Vovo proviced the test code last night. I don't know if this is
>>>>>>>>> exactly what is observed in the benchmark, or how it relates to the
>>>>>>>>> benchmark in the first place, but it is the first time we are actually
>>>>>>>>> able to reliably create a condition where the problem is seen.
>>>>>>>>
>>>>>>>> Thank you. I will be interesting to hear what is happening in the
>>>>>>>> chrome perfomance benchmark that triggers this.
>>>>>>>>
>>>>>>> What's happening in the benchmark:
>>>>>>> 1. A chrome renderer process was created with CLONE_NEWPID
>>>>>>> 2. The process crashed
>>>>>>> 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
>>>>>>> threads of the crashed process to dump info
>>>>>>> 4. When breakpad detach the crashed process, the crashed process stuck in
>>>>>>> zap_pid_ns_processes()
>>>>>>
>>>>>> Very interesting thank you.
>>>>>>
>>>>>> So the question is specifically which interaction is causing this.
>>>>>>
>>>>>> In the test case provided it was a sibling task in the pid namespace
>>>>>> dying and not being reaped. Which may be what is happening with
>>>>>> breakpad. So far I have yet to see kernel bug but I won't rule one out.
>>>>>>
>>>>>
>>>>> I am trying to understand what you are looking for. I would have thought
>>>>> that both the test application as well as the Chrome functionality
>>>>> described above show that there are situations where zap_pid_ns_processes()
>>>>> can get stuck and cause hung task timeouts in conjunction with the use of
>>>>> ptrace().
>>>>>
>>>>> Your last sentence seems to suggest that you believe that the kernel might
>>>>> do what it is expected to do. Assuming this is the case, what else would
>>>>> you like to see ? A test application which matches exactly the Chrome use
>>>>> case ? We can try to provide that, but I don't entirely understand how
>>>>> that would change the situation. After all, we already know that it is
>>>>> possible to get a thread into this condition, and we already have one
>>>>> means to reproduce it.
>>>>>
>>>>> Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
>>>>> test application and the Chrome benchmark. The thread is still stuck in
>>>>> zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
>>>>> and no longer results in a hung task timeout. It remains in that state
>>>>> until the parent process terminates. I am not entirely happy with it
>>>>> since the processes are still stuck and may pile up over time, but at
>>>>> least it solves the immediate problem for us.
>>>>>
>>>>> Question now is what to do with that solution. We can of course apply
>>>>> it locally to Chrome OS, but I would rather have it upstream - especially
>>>>> since we have to assume that any users of Chrome on Linux, or more
>>>>> generically anyone using ptrace in conjunction with CLONE_NEWPID, may
>>>>> experience the same problem. Right now I have no idea how to get there,
>>>>> though. Can you provide some guidance ?
>>>>
>>>> Apologies for not being clear. I intend to send a pull request with the
>>>
>>> No worries.
>>>
>>>> the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
>>>> next week or so with a Cc stable and an appropriate Fixes tag. So the
>>>> fix can be backported.
>>>>
>>> Great, that is good to know.
>>>
>>>> I have a more comprehensive change queued I will probably merge for 4.13
>>>> already but it just changes what kind of zombies you see. It won't
>>>> remove the ``stuck'' zombies.
>>>>
>>>> So what I am looking for now is:
>>>> Why are things getting stuck in your benchmark?
>>>>
>>>> - Is it a userspace bug?
>>>>
>>>> In which case we can figure out what userspace (aka breakpad) needs
>>>> to do to avoid the problem.
>>>>
>>> I spent some time trying to understand what breakpad is doing. I don't
>>> really see how it can be blamed. It attaches itself to a crashing thread,
>>> gets the information it is looking for, and detaches itself. At the
>>> same time, whatever it does, it seems to me that userspace should not
>>> be able to create such a situation in the first place.
>>
>> I suspect what is happening is that the thread breakpad is attached to
>> is killed by zap_pid_ns_processes. At which point PTRACE_DETACH won't
>> work because the thread has been killed and waitpid(thread_pid,...)
>> needs to be used instead.
>>
>
> I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
> a zombie process.
>
> I guess the only question left is if zap_pid_ns_processes() should (or could)
> somehow detect that situation and return instead of waiting forever.
> What do you think ?

In this case zap_pid_ns_processes is waiting for everything to be done
with the processes in the pid namespace. As something outside the
namespace is tracing processes inside the namespace things they aren't
done with the processes in the pid namespace.

So in general I think the current behavior is correct. If the tracer
(breakpad?) had set SA_NOCLDWAIT or set SIGCHLD to SIG_IGN I would
figure the failure as a kernel bug.

At this point it just appears that the tracer has a bug (ignoring
SIGCHLD and not reaping dead children) that should be fixed.

Eric