Re: [PATCH] nsproxy: attach to namespaces via pidfds

From: Michael Kerrisk (man-pages)
Date: Mon Apr 27 2020 - 16:07:05 EST


Hello Christian,

On 4/27/20 4:36 PM, Christian Brauner wrote:
> For quite a while we have been thinking about using pidfds to attach to
> namespaces.

(Sounds promising.)

> This patchset has existed for about a year already but we've
> wanted to wait to see how the general api would be received and adopted.
> Now that more and more programs in userspace have started using pidfds
> for process management it's time to send this one out.
>
> This patch makes it possible to use pidfds to attach to the namespaces
> of another process, i.e. they can be passed as the first argument to the
> setns() syscall. When only a single namespace type is specified the
> semantics are equivalent to passing an nsfd. That means
> setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However,
> when a pidfd is passed, multiple namespace flags can be specified in the
> second setns() argument and setns() will attach the caller to all the
> specified namespaces all at once or to none of them.

While I think I understand what the intended semantics are, the
description in the previous paragraph feels off, so that if
this whole text lands in a commit message (or a manual page),
I think it needs fixing.

Firs, it seems odd to say that

"setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET)"

setns(nsfd, CLONE_NEWNET) means: fail if nsfd does not refer to a
network namespace.

setns(pidfd, CLONE_NEWNET) means: move into just the network
namespace of the process referred to by 'pidfd'.

I would not call those two things "equal", in a semantic sense.

And then:

> If 0 is specified
> together with a pidfd then setns() will interpret it the same way 0 is
> interpreted together with a nsfd argument, i.e. attach to any/all
> namespaces.

If I understand right, setns(pidfd, 0) would mean: move into
all of the same namespaces as the process referred to by 'pidfd'.

But setns(nsfd, 0) means: move into whatever kind of namespace
is referred to by 'nsfd'.

I would not say of these two cases that 0 is interpreted
in the same way.

Hopefully I have not misunderstood.



> The obvious example where this is useful is a standard container
> manager interacting with a running container: pushing and pulling files
> or directories, injecting mounts, attaching/execing any kind of process,
> managing network devices all these operations require attaching to all
> or at least multiple namespaces at the same time. Given that nowadays
> most containers are spawned with all namespaces enabled we're currently
> looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns>
> nsfds, another 7 to actually perform the namespace switch. With time
> namespaces we're looking at about 16 syscalls.
> (We could amortize the first 7 or 8 syscalls for opening the nsfds by
> stashing them in each container's monitor process but that would mean
> we need to send around those file descriptors through unix sockets
> everytime we want to interact with the container or keep on-disk
> state. Even in scenarios where a caller wants to join a particular
> namespace in a particular order callers still profit from batching
> other namespaces. That mostly applies to the user namespace but
> all container runtimes I found join the user namespace first no matter
> if it privileges or deprivileges the container.)
> With pidfds this becomes a single syscall no matter how many namespaces
> are supposed to be attached to.

That does seem like a win. Thanks for working on this!

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/