Re: [RFC] [PATCH 0/7] Some basic vserver infrastructure

From: Eric W. Biederman
Date: Fri Mar 24 2006 - 15:27:58 EST


Kirill Korotaev <dev@xxxxx> writes:


>> I certainly have not. I do feel that developing this just from the
>> top down is the wrong way to do this. In some of the preliminary
>> patches we have found several pieces of code that we will have to
>> touch that is currently in need of a cleanup. That is why I have
>> been cleaning up /proc. sysctl is in need of similar treatment
>> but is in less bad shape.
> Eric, though I suggest to postpone proc and sysctl a bit, can you share
> me your vision of /proc and /sysctl virtualization a bit?
> A good way to handle them IMHO is to make fully virtual, i.e. each
> namespace should have an own set of sysctl or proc tree.

Roughly I agree. Some cases are easier than others. So let me take
just the sysvipc case as an example.

My thinking is move the calls for printing the sysvipc namespace
from fs/proc/generic.c (with all of it's cool helpers) to
fs/proc/base.c.

So we wind up with:
/proc/<pid>/sysvipc/msg
/proc/<pid>/sysvipc/sem
/proc/<pid>/sysvipc/shm
/proc/sysvipc -> /proc/self/sysvipc

For sysctl we add a method to fetch the address of
the variable and perhaps a few other attributes,
that method is passed a task structure.

Then we can have per process instances of:
/proc/<pid>/sys/sem
/proc/<pid>/sys/shmall
/proc/<pid>/sys/shmmax
/proc/<pid>/sys/msgmax
/proc/<pid>/sys/msgmni
/proc/<pid>/sys/shmmni
And a symlink at:
/proc/sys that points to /proc/<pid>/sys

Getting sysvipc to show up in a per process fashion is pretty
easy. Getting the entire sys hierarchy to show up per process
is a little harder simply because I think to do it cleanly requires
help functions that I don't have yet. I have removed all of
the internal dependence on magic inode numbers completely removing
the hard coded inode numbers and putting sys looks doable.

Does that sound like a reasonable model?

>> Part of it is that I have stopped to look more closely at what
>> other people are doing and to look at alternative implementations.
> If you need any help with it in OpenVZ, feel free to ask. We have
> broken-out patches for recent 2.6.16 kernel.


>> One interesting thing I have manged to do is by using ptrace I
>> have implemented enter for the existing filesystem namespaces without having
>> to modify the kernel. This at least says
>> that enter and debugging are two faces of the same coin.
> Hmmm, strange claim/conclusion... /dev/kmem allows to change namespaces
> also :) and even to obtain root priviliges if needed... :)

True. However this is much less ugly then using /dev/kmem, and it is
much closer to what applications like user mode linux do. The primary
question in my mind was what should the permissions checks be when
performing this kind of action. Using ptrace satisfied that.

So I now have a bounding box for what enter should be able to do
and what permissions it should take.

> Eric, let's not compare approaches with inches :)
> As you remember your PID namespaces doesn't suite us well... :(

More discussion when the time is right. But I believe I have solved
the fundamental incompatibility that we had. I asked you a question
to confirm that a while ago, but I have not heard anything back.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/