Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

From: Rob Landley
Date: Wed Oct 08 2014 - 17:36:12 EST


On 10/08/14 14:31, Andy Lutomirski wrote:
> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
> <ebiederm@xxxxxxxxxxxx> wrote:
>> Andy Lutomirski <luto@xxxxxxxxxxxxxx> writes:
>>>> Maybe we want to say that rootfs should not be used if we are going to
>>>> create containers...
>>
>> Today it is an assumption of the vfs that rootfs is mounted. With
>> rootfs mounted and pivot_root at the base of the mount stack you can
>> make as minimal of a set of mounts as the vfs allows.
>>
>> Removing rootfs from the vfs requires an audit of everything that
>> manipulates mounts. It is not remotely a local excercise.
>
> Would it be a less invasive audit to allow different mount namespaces
> to have different rootfses?

I.E. The same way different namespaces have different init tasks?

The abstraction containers has implemented here should be logically
consistent.

>>> Could we have an extra rootfs-like fs that is always completely empty,
>>> doesn't allow any writes, and can sit at the bottom of container
>>> namespace hierarchies? If so, and if we add a new syscall that's like
>>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
>>> to that rootfs then.
>>
>> Or equally have something that guarantees that rootfs is empty and
>> read-only at the time the normal root filesystem is mounted. That is
>> certainly a much more localized change if we want to go there.
>>
>> I am half tempted to suggest that mount --move /some/path / be updated
>> to make the old / just go away (perhaps to be replaced with a read-only
>> empty rootfs). That gets us into figuring out if we break userspace
>> which is a big challenge.
>
> Hence my argument for a new syscall or entirely new operation.

I'm still waiting for somebody to explain to my why chroot() shouldn't
be changed to do this instead of adding a new syscall. (At least when
mount namespace support is enabled.)

> mount(2) and friends are way too multiplexed right now. I just found
> yet another security bug due to the insanely complicated semantics of
> the vfs syscalls. (Yes, a different one from the one yesterday.)

As the guy who rewrote busybox mount 3 times, and who just implemented a
brand new one (toybox) from scratch:

It's a bit fiddly, yes.

> A new operation kills several birds with one stone. It could look like:
>
> int mntns_change_root(int dfd, const char *path, int flags);
>
> return -EPERM if chrooted.

Really?

> Returns -EINVAL if path (relative to dfd) isn't a mountmount.

Requiring that chroot() only be called on mountpoints would break
existing semantics, which gets us back to new systemcall instead of
changing behavior of existing one.

If I recall, the first line of pushback against merging the openvz code
as is was "buckets of new syscalls". Pushback against adding a new
system call is understandable. Why can't we fix chroot() now that we
have the tools to do so?

> Otherwise it disconnects path from the existing
> hierarchy, attaches a permanently-empty read-only rootfs under it,
> makes it the root of the mntns, and does the root refs fixup. The old
> hierarchy gets thrown out.

We have a chroot() syscall. We don't use it for containers because it
doesn't do what we want. Does it currently do what _anybody_ wants?

> Systemd could use this, too.

While that's a strong argument against it, I'm willing to overlook it.

Rob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/