Re: [PATCH RFC] pidns: introduce syscall getvpid

From: Serge E. Hallyn
Date: Wed Sep 16 2015 - 12:31:33 EST


On Wed, Sep 16, 2015 at 09:49:02AM -0500, Eric W. Biederman wrote:
> "Serge E. Hallyn" <serge@xxxxxxxxxx> writes:
>
> > On Wed, Sep 16, 2015 at 10:37:33AM +0300, Konstantin Khlebnikov wrote:
> >> On 15.09.2015 20:41, Serge Hallyn wrote:
> >> >Quoting Stéphane Graber (stgraber@xxxxxxxxxx):
> >> >>On Tue, Sep 15, 2015 at 06:01:38PM +0300, Konstantin Khlebnikov wrote:
> >> >>>On 15.09.2015 17:27, Eric W. Biederman wrote:
> >> >>>>Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> writes:
> >> >>>>
> >> >>>>>pid_t getvpid(pid_t pid, pid_t source, pid_t target);
> >> >>>>>
> >> >>>>>This syscall converts pid from one pid-ns into pid in another pid-ns:
> >> >>>>>it takes @pid in namespace of @source task (zero for current) and
> >> >>>>>returns related pid in namespace of @target task (zero for current too).
> >> >>>>>If pid is unreachable from target pid-ns then it returns zero.
> >> >>>>
> >> >>>>This interface as presented is inherently racy. It would be better
> >> >>>>if source and target were file descriptors referring to the namespaces
> >> >>>>you wish to translate between.
> >> >>>
> >> >>>Yep, it's racy. As well as any operation with non-child pids.
> >> >>>With file descriptors for source/target result will be racy anyway.
> >> >>>
> >> >>>>
> >> >>>>>Such conversion is required for interaction between processes from
> >> >>>>>different pid-namespaces. For example when system service talks with
> >> >>>>>client from isolated container via socket about task in container:
> >> >>>>
> >> >>>>Sockets are already supported. At least the metadata of sockets is.
> >> >>>>
> >> >>>>Maybe we need this but I am not convinced of it's utility.
> >> >>>>
> >> >>>>What are you trying to do that motivates this?
> >> >>>
> >> >>>I'm working on hierarchical container management system which
> >> >>>allows to create and control nested sub-containers from containers
> >> >>>( https://github.com/yandex/porto ). Main server works in host and
> >> >>>have to interact with all levels of nested namespaces. This syscall
> >> >>>makes some operations much easier: server must remember only pid in
> >> >>>host pid namespace and convert it into right vpid on demand.
> >> >>
> >> >>Note that as Eric said earlier, sending a PID inside a ucred through a
> >> >>unix socket will have the pid translated.
> >> >>
> >> >>So while your solution certainly should be faster, you can already achieve
> >> >>what you want today by doing:
> >> >>
> >> >>== Translate PID in container to PID in host
> >> >> - open a socket
> >> >> - setns to container's pidns
> >> >> - send ucred from that container containing the requested container PID
> >> >> - host sees the host PID
> >> >>
> >> >>== Translate PID on host to PID in container
> >> >> - open a socket
> >> >> - setns to container's pidns
> >> >> - send ucred from the host containing the request host PID
> >> >> (send will fail if the host PID isn't part of that container)
> >> >> - container sees the container PID
> >> >
> >> >In addition, since commit e4bc332451 : /proc/PID/status: show all sets of pid according to ns
> >> >we now also have 'NSpid' etc in /proc/$$/status.
> >> >
> >>
> >> As I see this works perfectly only for converting host pid into virtual.
> >>
> >> Backward conversion is troublesome: we have to scan all pids in host
> >> procfs and somehow filter tasks from container and its sub-pid-ns.
> >> Or I am missing something trivial?
> >
> > Ah, no that doesn't help with this.
> >
> > What Stéphane describes is what I've done in several projects.
> > Getting it right is however actually quite tricky. I'm not
> > convinced it's at the level of "since you can do (sweep hands)
> > all this, we don't need a simple syscall to do it."
> >
> > So I'd encourage you to resend using namespace inode fds for
> > source and target as Eric suggested. We still may decide that
> > the syscall isn't needed, but it's a trivial change to your
> > patch and removes that race. And I'm not convinced it's not
> > needed.
>
> At this point my primary concern is that a pattern that would need to be
> convering to and from pids quickly is potentially fundamentally racy to
> the point of broken.

The cgmanager GetTasks and GetTasksRecursive, and reading of the
lxcfs cgroup /tasks files, require converting every pid from the
cgmanager's namespace to the reading task's namespace.

> Especially with unix domain sockets passing and converting pids in a way
> that covers the common case.
>
> I am clearly missing some nuance of this use case.

lxcfs and cgmanager are imo proof that we *can* do without the new
syscall. However, the git history will show that there are some
complications, and the system load when a few systemds are starting
will show that it does take a performance toll on the host at some
point. Still as I say it's doable. The syscall implementation was
very simple, though.

-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/