On Mon, Sep 21, 2015 at 10:49:39AM +0800, Chen Fan wrote:
On 09/17/2015 12:31 AM, Serge E. Hallyn wrote:
On Wed, Sep 16, 2015 at 09:49:02AM -0500, Eric W. Biederman wrote:
"Serge E. Hallyn" <serge@xxxxxxxxxx> writes:The cgmanager GetTasks and GetTasksRecursive, and reading of the
On Wed, Sep 16, 2015 at 10:37:33AM +0300, Konstantin Khlebnikov wrote:At this point my primary concern is that a pattern that would need to be
On 15.09.2015 20:41, Serge Hallyn wrote:Ah, no that doesn't help with this.
Quoting Stéphane Graber (stgraber@xxxxxxxxxx):As I see this works perfectly only for converting host pid into virtual.
On Tue, Sep 15, 2015 at 06:01:38PM +0300, Konstantin Khlebnikov wrote:In addition, since commit e4bc332451 : /proc/PID/status: show all sets of pid according to ns
On 15.09.2015 17:27, Eric W. Biederman wrote:Note that as Eric said earlier, sending a PID inside a ucred through a
Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> writes:Yep, it's racy. As well as any operation with non-child pids.
pid_t getvpid(pid_t pid, pid_t source, pid_t target);This interface as presented is inherently racy. It would be better
This syscall converts pid from one pid-ns into pid in another pid-ns:
it takes @pid in namespace of @source task (zero for current) and
returns related pid in namespace of @target task (zero for current too).
If pid is unreachable from target pid-ns then it returns zero.
if source and target were file descriptors referring to the namespaces
you wish to translate between.
With file descriptors for source/target result will be racy anyway.
I'm working on hierarchical container management system whichSuch conversion is required for interaction between processes fromSockets are already supported. At least the metadata of sockets is.
different pid-namespaces. For example when system service talks with
client from isolated container via socket about task in container:
Maybe we need this but I am not convinced of it's utility.
What are you trying to do that motivates this?
allows to create and control nested sub-containers from containers
( https://github.com/yandex/porto ). Main server works in host and
have to interact with all levels of nested namespaces. This syscall
makes some operations much easier: server must remember only pid in
host pid namespace and convert it into right vpid on demand.
unix socket will have the pid translated.
So while your solution certainly should be faster, you can already achieve
what you want today by doing:
== Translate PID in container to PID in host
- open a socket
- setns to container's pidns
- send ucred from that container containing the requested container PID
- host sees the host PID
== Translate PID on host to PID in container
- open a socket
- setns to container's pidns
- send ucred from the host containing the request host PID
(send will fail if the host PID isn't part of that container)
- container sees the container PID
we now also have 'NSpid' etc in /proc/$$/status.
Backward conversion is troublesome: we have to scan all pids in host
procfs and somehow filter tasks from container and its sub-pid-ns.
Or I am missing something trivial?
What Stéphane describes is what I've done in several projects.
Getting it right is however actually quite tricky. I'm not
convinced it's at the level of "since you can do (sweep hands)
all this, we don't need a simple syscall to do it."
So I'd encourage you to resend using namespace inode fds for
source and target as Eric suggested. We still may decide that
the syscall isn't needed, but it's a trivial change to your
patch and removes that race. And I'm not convinced it's not
needed.
convering to and from pids quickly is potentially fundamentally racy to
the point of broken.
lxcfs cgroup /tasks files, require converting every pid from the
cgmanager's namespace to the reading task's namespace.
Especially with unix domain sockets passing and converting pids in a waylxcfs and cgmanager are imo proof that we *can* do without the new
that covers the common case.
I am clearly missing some nuance of this use case.
syscall. However, the git history will show that there are some
complications, and the system load when a few systemds are starting
will show that it does take a performance toll on the host at some
point. Still as I say it's doable. The syscall implementation was
very simple, though.
Yes, previous email discussed about the implementation of syscall or procfs:
http://www.gossamer-threads.com/lists/linux/kernel/1971723?search_string=chen%20hanxiao;#1971723
but it seems complicated implemented by procfs, the original discussion at:
http://www.gossamer-threads.com/lists/linux/kernel/2076440?search_string=chen%20hanxiao;#2076440
So please implement it, as Eric suggested, using the ns inode fds
instead of racy pid_t hints for namespaces.