Re: [PATCH -mm 0/7] execns syscall and user namespace
From: Eric W. Biederman
Date: Wed Jul 12 2006 - 12:56:08 EST
Cedric Le Goater <clg@xxxxxxxxxx> writes:
> Hello !
>
> Hopefully, we will soon see each other at OLS. We need some synchronous
> interaction !
>
> Eric W. Biederman wrote:
>
>>>> Is it not possible to ensure what you are trying to ensure with
>>>> a good user space executable?
>>> unshare() is unsafe for some namespaces because namespaces can reference
>>> each other. For the ipc namespace, example are shm ids vs. vma, sem ids vs.
>>> semundos, msq vs. netlink sockets. for the user namespace, open files. So
>>> it seems reasonable to provide a way to unshare namespaces from a clean
>>> process context.
>>
>> It is perfectly legitimate to have a shared memory region memory mapped
>> from another namespace.
>
> then after unshare, a process can be in ipc namespace B with a shared
> memory segment from ipc namespace A without any id for this segment. this
> is not very consistent. the same process will also be able to modify the
> ipc namespace B without being in this namespace. ugly. It looks like an
> issue that should be solved.
>
> I think namespace should enforce strict isolation. nop ?
>
>> Yes sem ids versus semunds is an issue but it just requires you to unshare
>> one at the same time you unshare the other, or to simply clone a new
>> namespace.
>
> hmm, semids the from ipc namespace are stored in task->sysv_sem. i would
> forbid the unshare/clone in that case or flush the semundos like in
> exit_sem(). but it's easier not to have any, like in a clean process image.
>
>> I'm not familiar with the msq vs netlink socket issue.
>
> mq_notify can use a netlink socket to send an event back user space.
Ok. That one is a mess, and I almost recall seeing that. A big
chunk of that is a general netlink socket problem. Getting enough
context in a netlink socket is a challenge because you can't use current.
I do think solving that is achievable though. Just very peculiar.
>> As for the user namespace vs open files. If we have any issues with open
>> files in any namespace that sounds like an implementation bug to me.
>
> user_struct does accounting on process, open files, locked memory, signals,
> etc. if you unshare such an object, you will need to unshare all others
> namespaces to be consistent. again having a clean process image is easier ...
I just don't see it. The accounting is about objects and the namespaces
are about names of those objects.
>> I'm not convinced the problems you are seeing are not implementation bugs.
>> For some things clone is still more general then unshare, and clone should
>> be considered the primary user interface, not unshare.
>
> agree on that, i might be focusing a bit too much on the unshare syscall.
> we should work on clone to make sure it has the required restrictions. The
> system is really interlinked and not all namespaces can be unshared standalone.
I completely agree that there are pieces that interlink.
>>> Now, if you try to do that from user space, you will call unshare() then
>>> execve(), which leaves plenty of room and time for nasty things to happen
>>> in between the 2 calls.
>>
>> I will look more closely but I think there is an important point being missed
>> somewhere. Pieces of the kernel interact in all sorts of weird and unexpected
>> ways. If we rely on ourselves always being in the right magic namespace for
>> things to work correctly we are setting ourselves up for trouble. If we know
>> a namespace implementation will work even when a process has access to
> entities
>> in multiple instances of that namespace we are in much better shape.
>
> having a clean process image is IMHO required for some namespaces :
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=113881171017330&w=2
That message is a terrible example. Unless you are thinking of something
farther down that thread. User space getting confused when it creates
a container is just an implementation of the container creation code.
Now I'm not certain what you mean by a clean process image, as there are always
left over pieces from the parent. Clone creates a new task_struct. exec replaces
the executable. They both keep files open.
Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/