Re: Threads stuck in zap_pid_ns_processes()
From: Eric W. Biederman
Date: Thu Jun 01 2017 - 21:14:03 EST
Guenter Roeck <linux@xxxxxxxxxxxx> writes:
> On Thu, Jun 01, 2017 at 02:36:38PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck <linux@xxxxxxxxxxxx> writes:
>>
>> > On Thu, Jun 01, 2017 at 12:08:58PM -0500, Eric W. Biederman wrote:
>> >> Guenter Roeck <linux@xxxxxxxxxxxx> writes:
>> >> >
>> >> > I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
>> >> > a zombie process.
>> >> >
>> >> > I guess the only question left is if zap_pid_ns_processes() should (or could)
>> >> > somehow detect that situation and return instead of waiting forever.
>> >> > What do you think ?
>> >>
>> >> Any chance you can point me at the chromium code that is performing the
>> >> ptrace?
>> >>
>> >> I want to conduct a review of the kernel semantics to see if the current
>> >> semantics make it unnecessarily easy to get into hang situations. If
>> >> the semantics make it really easy to get into a hang situation I want
>> >> to see if there is anything we can do to delicately change the semantics
>> >> to avoid the hangs without breaking existing userspace.
>> >>
>> > The internal bug should be accessible to you.
>> >
>> > https://bugs.chromium.org/p/chromium/issues/detail?id=721298&desc=2
>> >
>> > It has some additional information, and points to the following code in Chrome.
>> >
>> > https://cs.chromium.org/chromium/src/breakpad/src/client/linux/minidump_writer/linux_ptrace_dumper.cc?rcl=47e51739fd00badbceba5bc26b8abc8bbd530989&l=85
>> >
>> > With the information we have, I don't really have a good idea what we could or
>> > should change in Chrome to make the problem disappear, so I just concluded that
>> > we'll have to live with the forever-sleeping task.
>>
>> I believe I see what is happening. The code makes the assumption that a
>> thread will stay stopped and will not go away once ptrace attach
>> completes.
>>
>> Unfortunately if someone sends SIGKILL to the process or exec sends
>> SIGKILL to the individual thread then PTRACE_DETACH will fail.
>>
>> At which point you can use waitpid to reap the zombie and detach
>> from the thread.
>>
>> So I think the forever-sleeping can be fixed with something as simple
>> as changing ResumeThread to say:
>>
>> // Resumes a thread by detaching from it.
>> static bool ResumeThread(pid_t pid) {
>> if (sys_ptrace(PTRACE_DETACH, pid, NULL, NULL) >= 0)
>> return true;
>> /* Someone killed the thread? */
>> return waitpid(pid, NULL, 0) == pid;
>> }
>>
>> It almost certainly makes sense to fix PTRACE_DETACH in the kernel to
>> allow this case to work. And odds are good that we could make that
>> change without breaking anyone. So it is worth a try.
>>
>
> Do I interpret this correctly as "the above code should work, but currently
> doesn't" ?
I added the early exit and the fallback waitpid clause. So I am saying
with a trivial modification the code can be made to work.
Eric