Re: [RFC] [PATCH 00/13] Introduce task_pid api

From: Hubertus Franke
Date: Tue Nov 15 2005 - 09:37:40 EST


Serge E. Hallyn wrote:
Quoting Paul Jackson (pj@xxxxxxx):


There have been a few suggestions going fro and back.
Let me address them all at once.

(A) why a vpid?

For transparent checkpointing. Vserver for instance has not implemented
a checkpoint/restart yet, because without this concept it is not possible.
The moment you want transparent checkpoint, you need to deal with the fact
that the results of a getpid() are in register (worst case) and upon
restart the system must provide the same pid on the different machine.
That immediately suggest pid range reservation... but see point (B) below.

(B) syscall interception and LD_PRELOAD:

In principle that is possible, but it leads to potentially inefficient code
and at large leaves the issue of pid space creation and migration on the table.
However it makes clear that as long as I keep the transformation or mappings
consistent between virtual and real, that this is a quite useful concept.

The question now is how deep into the kernel do I have to drive it in order to
create an efficient implementation.

(C) Fixed PID range allocation:

That is completely unscalable and unnecessary:

First PID range allocation at a global level (e.g. cluster level) requires some agent.
Given that PID_MAX ~ 2**22 leaves us on 32-bit architectures with only 512 pidspaces (negative
range needs to be preserved I think).
However it is not unreasonable to assume that 512 different pidspaces per OS image is not
a restriction.
Hence, when a pidspace is migrated it will be assigned a different pidspace id.
Then going with kernelpid = (pidspace_id << 22) | vpid is an efficient means to
map between virtual pidspace and physical pidspace and vice versa.
All that needs to be managed is local pidspace allocation.
The translations from vpid <-> pid are very light weight as can be seen from the above
composition.

Take for example the vserver system. A local vserver agent could maintain the
pidspace allocation. On creation of a vserver it assigns the next available pidspace.
That pidspace id is internal to vserver and is not exported as a property of a vserver.
When a vserver is migrated to a different machine, a potentially different pidspace
is allocate, yet all the vpids remain the same.

(D) Cross compilation

I do all stuff on s390 so that space is covered.

If I missed some of the issues that were raised let me know and we will try to address
those.

I am part of Serge's team and have been working on intercepting the various places
where virtual to real pid translations have to occur in the kernel.
It's still in pretty bad shape, but it boots for the default pid space (:- ).
Of my head I say there are about 40 places each to do the translation.
Many are in the /proc/fs, some in the signal handling

I hope by end of the week I have something to post that gives idea how we are thinking
this could be realized.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/