Re: Threads stuck in zap_pid_ns_processes()

From: Eric W. Biederman
Date: Thu Jun 01 2017 - 15:44:03 EST


Guenter Roeck <linux@xxxxxxxxxxxx> writes:

> On Thu, Jun 01, 2017 at 12:08:58PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck <linux@xxxxxxxxxxxx> writes:
>> >
>> > I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
>> > a zombie process.
>> >
>> > I guess the only question left is if zap_pid_ns_processes() should (or could)
>> > somehow detect that situation and return instead of waiting forever.
>> > What do you think ?
>>
>> Any chance you can point me at the chromium code that is performing the
>> ptrace?
>>
>> I want to conduct a review of the kernel semantics to see if the current
>> semantics make it unnecessarily easy to get into hang situations. If
>> the semantics make it really easy to get into a hang situation I want
>> to see if there is anything we can do to delicately change the semantics
>> to avoid the hangs without breaking existing userspace.
>>
> The internal bug should be accessible to you.
>
> https://bugs.chromium.org/p/chromium/issues/detail?id=721298&desc=2
>
> It has some additional information, and points to the following code in Chrome.
>
> https://cs.chromium.org/chromium/src/breakpad/src/client/linux/minidump_writer/linux_ptrace_dumper.cc?rcl=47e51739fd00badbceba5bc26b8abc8bbd530989&l=85
>
> With the information we have, I don't really have a good idea what we could or
> should change in Chrome to make the problem disappear, so I just concluded that
> we'll have to live with the forever-sleeping task.

I believe I see what is happening. The code makes the assumption that a
thread will stay stopped and will not go away once ptrace attach
completes.

Unfortunately if someone sends SIGKILL to the process or exec sends
SIGKILL to the individual thread then PTRACE_DETACH will fail.

At which point you can use waitpid to reap the zombie and detach
from the thread.

So I think the forever-sleeping can be fixed with something as simple
as changing ResumeThread to say:

// Resumes a thread by detaching from it.
static bool ResumeThread(pid_t pid) {
if (sys_ptrace(PTRACE_DETACH, pid, NULL, NULL) >= 0)
return true;
/* Someone killed the thread? */
return waitpid(pid, NULL, 0) == pid;
}

It almost certainly makes sense to fix PTRACE_DETACH in the kernel to
allow this case to work. And odds are good that we could make that
change without breaking anyone. So it is worth a try.

Eric