Re: [PATCH] pid_ns: Allow to get pid_for_children ns before child_reaper is created

From: Kirill Tkhai
Date: Thu Jun 08 2017 - 08:17:57 EST


ping

On 29.05.2017 13:49, Kirill Tkhai wrote:
> On 27.05.2017 14:01, Eric W. Biederman wrote:
>> Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes:
>>
>>> This patch prohibits pid allocation till child_reaper
>>> of pid namespace is set, and it makes possible and safe
>>> to get just unshared pid_ns from "/proc/[pid]/ns/pid_for_children"
>>> file. This may be useful to determine user_ns of such a created
>>> pid_ns, which is not possible now.
>>>
>>> It was prohibited till now, because the architecture of pid namespaces
>>> assumes child reaper is the firstly created process of the namespace,
>>> and it initializes pid_namespace::proc_mnt. Child reaper creation
>>> mustn't race with creation of another processes from this namespace,
>>> otherwise a process with pid > 1 may die before pid_namespace::proc_mnt
>>> is populated and it will get a null pointer dereference in proc_flush_task().
>>> Also, child reaper mustn't die before processes from the namespace.
>>
>> This patch introduces the possibility that two or more processes may
>> have the same pid namespace (with no processes) as pid_ns_for_children.
>>
>> Which means you can now have a race for the first pid in alloc_pid.
>> Making it indeterminant who allocates the init process. Which is not
>> acceptable.
>>
>> It is not acceptable on two grounds.
>> 1) It is a bogus user space semantic. Because userspace needs to
>> know who allocates init.
>> 2) It is horrible for maintenance becuase now the code has to be very
>> clever to deal with a case that no one cares about. Which is
>> a general formula for buggy code.
>
> We may disallow setns() if there is no child reaper created, and
> this solves all above issues. Please see v2 below, it has no problems
> you pointed.
>
> [PATCH v2]pid_ns: Allow to get pid_for_children ns before child_reaper is created
>
> This patch prohibits setns() on a pid namespace till its child_reaper
> is set, and it makes possible and safe to get just unshared pid_ns
> from "/proc/[pid]/ns/pid_for_children" file. This may be useful
> to determine user_ns of such a created pid_ns, which is not possible now.
>
> It was not possible till now, because the architecture of pid namespaces
> assumes child reaper is the first created process of the namespace,
> and it initializes pid_namespace::proc_mnt. Child reaper creation
> mustn't race with creation of another processes from this namespace,
> otherwise a process with pid > 1 may die before pid_namespace::proc_mnt
> is populated and it will get a null pointer dereference in proc_flush_task().
> Also, child reaper mustn't die before processes from the namespace.
>
> The patch prevents such races. It allows to setns() on a pid namespace
> only if ns->child_reaper is already set, and this guarantees, that
> only pid namespace creator may establish child reaper.
> So, we can safely allow to get "/proc/[pid]/ns/pid_for_children"
> since it's created, and to analyse it.
>
> v2: Don't race for child reaper creation.
>
> Signed-off-by: Kirill Tkhai <ktkhai@xxxxxxxxxxxxx>
> CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> CC: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
> CC: Oleg Nesterov <oleg@xxxxxxxxxx>
> CC: Andy Lutomirski <luto@xxxxxxxxxx>
> CC: Serge Hallyn <serge@xxxxxxxxxx>
> CC: Michal Hocko <mhocko@xxxxxxxx>
> CC: Andrei Vagin <avagin@xxxxxxxxxx>
> CC: Cyrill Gorcunov <gorcunov@xxxxxxxxxx>
> CC: Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx>
> CC: Ingo Molnar <mingo@xxxxxxxxxx>
> CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> ---
> kernel/pid_namespace.c | 18 +++++++++---------
> 1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 74a5a7255b4d..5e7b3fd0d4c2 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -385,15 +385,6 @@ static struct ns_common *pidns_for_children_get(struct task_struct *task)
> }
> task_unlock(task);
>
> - if (ns) {
> - read_lock(&tasklist_lock);
> - if (!ns->child_reaper) {
> - put_pid_ns(ns);
> - ns = NULL;
> - }
> - read_unlock(&tasklist_lock);
> - }
> -
> return ns ? &ns->ns : NULL;
> }
>
> @@ -428,6 +419,15 @@ static int pidns_install(struct nsproxy *nsproxy, struct ns_common *ns)
> if (ancestor != active)
> return -EINVAL;
>
> + /*
> + * Disallow processes to use pid namespace till its
> + * creator makes child reaper. Otherwise, several
> + * processes race for that, and it's not clear who
> + * establishes init.
> + */
> + if (!new->child_reaper)
> + return -ESRCH;
> +
> put_pid_ns(nsproxy->pid_ns_for_children);
> nsproxy->pid_ns_for_children = get_pid_ns(new);
> return 0;
>