Re: (pspace,pid) vs true pid virtualization

From: Eric W. Biederman
Date: Thu Feb 16 2006 - 11:41:27 EST

Next message: Edgar Hucek: "ACPI Troubles with 2.6.16-rc3"
Previous message: Heiko Carstens: "[patch 3/3] s390: sys32_fstatat -> sys32_fstatat64"
In reply to: Serge E. Hallyn: "Re: (pspace,pid) vs true pid virtualization"
Next in thread: Serge E. Hallyn: "Re: (pspace,pid) vs true pid virtualization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

"Serge E. Hallyn" <serue@xxxxxxxxxx> writes:

> Quoting Eric W. Biederman (ebiederm@xxxxxxxxxxxx):
>> "Serge E. Hallyn" <serue@xxxxxxxxxx> writes:
>> With respect to pids lets not get caught up in the implementation
>> details. Let's first get clear on what the semantics should be.
>>
>> - Should the first pid in a pid space have pid 1?
>>
>> - Should pid == 1 ignore signals, it doesn't have a handler for?
>>
>> - Should any children of pid 1 be allowed to live when pid == 1 is killed?
>
> But doesn't that depend on whether we use (pspace,pid) or vpids?

No. This is completely an implementation detail.

> If
> vpids, then init isn't really a problem, since from kernelspace
> processes in a comtainer stil have a global pid and global parent, and
> init knows them. If (pspace,pid), then we need a fakeinit bc the real
> init doesn't know about the processes in the container...

If you take the fakeinit you can't run a normal distro, because you are
not running init, but that is a slightly separate issue.

If (pspace, pid) we still may want to wait until all of the process die
before we report to the parent.

I choose the semantics that I figured best match what currently happens
now and would most likely make existing user space software work.

>> - Should a process have some sort of global (on the machine identifier)?
>
> I think to satisfy openvz existing customers this must be a yes. With
> vpid the answer is simple. With (pspace,pid), there are three anwers i've
> heard, namely
>
> 1. just use pspaceid, pid
> 2. make pspaceid small and use (pspaceid << SOMEBITS | pid)
> 3. use pid1/pid2/pid3 where pid1 is creator of pid and its
> pspace, etc...
>
> But the openvz guys also don't want userspace tool changes, making (2)
> the most likely option. Any other ideas?

Not really.

Part of the problem is with pids we don't have very many bits, and
the users of pids aren't interested in all to all communication by
pids.

There is a certain sanity in making all of the pids visible in a
single flat pid namespace. But I think there are implementation
issues.

>> - Should the pids in a pid space be visible from the outside?
>
> Again, the openvz guys say yes.

Please let them speak for themselves.

> I think it should be acceptable if a pidspace is visible in all it's
> ancestor pidspaces. I.e. if I create pspace2 and pspace3 from pid 234
> in pspace1, then pspace2 doesn't need to be able to address pspace3
> and vice versa.

A good rule. Now consider pspace 4 which is a child of pid 567
in pspace 3.

What should pspace 3 see?
What should pspace 1 see?

What happens when you migrate pspace 3 into a different pspace
on a different machine?

Is there a sane implementation for this?

My only real objection to this approach is that I cannot see a sane
and simple implementation.

>> - Should the parent of pid 1 be able to wait for it for it's
>> children?
>
> Yes.
>
>> - Is a completely disjoin pid space acceptable to anyone?
>
> To anyone? yes :)
>
> TO everyone, I don't think so.

Well I would love to hear from someone it is acceptable to.
That would be an interesting perspective.

>> - What should the parent of pid == 1 see?
>>
>> - Should a process not in the default pid space be able to create
>> another pid space?
>
> Yes.
>
> This is to support using pidspaces for vservers, and creating
> migrateable sub-pidspaces in each vserver.

Agreed.

Now this case is very interesting, because supporting it creates
interesting restrictions on the rest of the problem, and
unless I miss something this is where the OpenVZ implementation
currently falls down.

Which names does the intermediate pidspace (vserver) see the child
pidspace with?

Which names does the initial pidspace see the child pid space with?

>> - Should we be able to monitor a pid space from the outside?
>
> To some extent, yes.

Especially if you don't want to migrate your monitor tools
with your jobs...

>> - Should we be able to have processes enter a pid space?
>
> IMO that is crucial.

It seems to be. However I am not yet convinced we want to
implement this in the kernel. The in-kernel implementation is
easy but that may be dictating security policy. So this
needs a thorough review.

A side note here is that if we allow ptrace from the outside
on even just the init process this opens things up very much
like an enter would (with respect to security).

>> - Do we need to be able to be able to ptrace/kill individual processes
>> in a pid space, from the outside, and why?
>
> I think this is completely unnecessary so long as a process can enter a
> pidspace.

Which is likely a good compromise.

>> - After migration what identifiers should the tasks have?
>
> It must be possible to retain the same pids, at least from inside the
> container.
>
> So this is irrelevant, as the openvz approach can just virtualize the
> old pid, while (pspace, pid) will be able to create a new container and
> use the old pid values, which are then guaranteed to not be in use.

Actually this gets to the heart of my issue with the openvz vpid
implementation. Either I am blind or it does not address the pids
in a nested pidspace when talking about migration. I am not even
certain it currently allows for nested pid spaces.

If I migrate a vserver with a sub-pidspace and migrate that what do I
see?

>> If we can answer these kinds of questions we can likely focus in
>> on what the implementation should look like. So far I have not
>> seen a question that could not be implemented with a (pspace, pid)/pid
>> or a vpid/pid implementation.
>
> But you have, haven't you? Namely, how can openvz provide it's
> customers with a global view of all processes without putting 5 years of
> work into a new sysadmin interface?

Well I think we can reuse most of the old sysadmin interfaces yes.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Edgar Hucek: "ACPI Troubles with 2.6.16-rc3"
Previous message: Heiko Carstens: "[patch 3/3] s390: sys32_fstatat -> sys32_fstatat64"
In reply to: Serge E. Hallyn: "Re: (pspace,pid) vs true pid virtualization"
Next in thread: Serge E. Hallyn: "Re: (pspace,pid) vs true pid virtualization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]