Re: [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped.

From: Eric W. Biederman
Date: Mon May 21 2012 - 20:16:48 EST


Oleg Nesterov <oleg@xxxxxxxxxx> writes:

> On 05/18, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@xxxxxxxxxx> writes:
>>
>> >> I think there is something very compelling about your solution,
>> >> we do need my bit about making the init process ignore SIGCHLD
>> >> so all of init's children self reap.
>> >
>> > Not sure I understand. This can work with or without 3/3 which
>> > changes zap_pid_ns_processes() to ignore SIGCHLD. And just in
>> > case, I think 3/3 is fine.
>>
>> The only issue I see is that without 3/3 we might have processes that
>> on one wait(2)s for and so will never have release_task called on.
>>
>> We do have the wait loop
>
> Yes, and we need this loop anyway, even if SIGCHLD is ignored.
> It is possible that we already have a EXIT_ZOMBIE child(s) when
> zap_pid_ns_processes().
>
>> but I think there is a race possible there.
>
> Hmm. I do not see any race, but perhaps I missed something.
> I think we can trust -ECHILD, or do_wait() is buggy.

Think about it some more you are right. For some reason
I had forgotten that without WNOHANG we don't block forever
until a child exits.

> Hmm. But there is another (off-topic) problem, security_task_wait()
> can return an error if there are some security policy problems...
> OK, this shouldn't happen I hope.

Agreed. We might be able to address that problem but that is indeed
another issue.

>> > And once again, this wait_event() + __wake_up_parent() is very
>> > simple and straightforward, we can cleanup this code later if
>> > needed.
>>
>> Yes, and it doesn't when you do an UNINTERRUPTIBLE sleep with
>> an INTERRUPTIBLE wake up unless I misread the code.
>
> Yes. so we need wait_event_interruptible() or __unhash_process()
> should use __wake_up_sync_key(wait_chldexit).
>
>> > Yes. This is the known oddity. We always notify the tracer if the
>> > leader exits, even if !thread_group_empty(). But after that the
>> > tracer can't detach, and it can't do do_wait(WEXITED).
>> >
>> > The problem is not that we can't "fix" this. Just any discussed
>> > fix adds the subtle/incompatible user-visible change.
>>
>> Yes and that is nasty.
>
> Agreed. ptrace API is nasty ;)
>
>> and moving detach_pid so we don't have to be super careful about
>> where we call task_active_pid_ns.
>
> Yes, I was thinking about this change too,
>
>> --- a/kernel/pid_namespace.c
>> +++ b/kernel/pid_namespace.c
>> @@ -189,6 +189,17 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>> rc = sys_wait4(-1, NULL, __WALL, NULL);
>> } while (rc != -ECHILD);
>>
>> + read_lock(&tasklist_lock);
>> + for (;;) {
>> + __set_current_state(TASK_INTERRUPTIBLE);
>> + if (list_empty(&current->children))
>> + break;
>> + read_unlock(&tasklist_lock);
>> + schedule();
>
> OK, but then it makes sense to add clear_thread_flag(TIF_SIGPENDING)
> before schedule, to avoid the busy-wait loop (like the sys_wait4 loop
> does). Or simply use TASK_UNINTERRUPTIBLE, I do not think it is that
> important to "fool" /proc/loadavg. But I am fine either way.

It can get darn strange when you hold a thread in stopped with ptrace
and your load mysteriously jumps. But we already have this problem
with de_thread and people aren't yelling so shrug.

So at a practical level Idon't think it is fooling /proc/loadavg but at
this point if we want more accuraccy from /proc/loadavg we need to fix
the computation and distinguish short term disk sleeps from other
uninterruptible sleeps and thus fix how /proc/loadavg is computed,
rather than hacking around with code like this.

> Maybe you can also add "ifdef CONFIG_PID_NS" into __unhash_process(),
> but this is minor too.

An #ifdef just leads to weird build failures that in weird rare
configurations. If we can hide it all away in a header fine, but
putting a bare #ifdef in the core of the code simply as a performance
optimization is ugly and a a major testing challenge. Keeping track of
all of the flying pieces with this patch has been tricky enough as it
is.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/