Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root

From: Serge Hallyn
Date: Wed Oct 08 2014 - 19:39:14 EST

Quoting Andy Lutomirski (luto@xxxxxxxxxxxxxx):
> On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley <rob@xxxxxxxxxxx> wrote:
> > On 10/08/14 14:31, Andy Lutomirski wrote:
> >> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
> >> <ebiederm@xxxxxxxxxxxx> wrote:
> >>> Andy Lutomirski <luto@xxxxxxxxxxxxxx> writes:
> >>>>> Maybe we want to say that rootfs should not be used if we are going to
> >>>>> create containers...
> >>>
> >>> Today it is an assumption of the vfs that rootfs is mounted. With
> >>> rootfs mounted and pivot_root at the base of the mount stack you can
> >>> make as minimal of a set of mounts as the vfs allows.
> >>>
> >>> Removing rootfs from the vfs requires an audit of everything that
> >>> manipulates mounts. It is not remotely a local excercise.
> >>
> >> Would it be a less invasive audit to allow different mount namespaces
> >> to have different rootfses?
> >
> > I.E. The same way different namespaces have different init tasks?
> >
> > The abstraction containers has implemented here should be logically
> > consistent.
> >
> >>>> Could we have an extra rootfs-like fs that is always completely empty,
> >>>> doesn't allow any writes, and can sit at the bottom of container
> >>>> namespace hierarchies? If so, and if we add a new syscall that's like
> >>>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
> >>>> to that rootfs then.
> >>>
> >>> Or equally have something that guarantees that rootfs is empty and
> >>> read-only at the time the normal root filesystem is mounted. That is
> >>> certainly a much more localized change if we want to go there.
> >>>
> >>> I am half tempted to suggest that mount --move /some/path / be updated
> >>> to make the old / just go away (perhaps to be replaced with a read-only
> >>> empty rootfs). That gets us into figuring out if we break userspace
> >>> which is a big challenge.
> >>
> >> Hence my argument for a new syscall or entirely new operation.
> >
> > I'm still waiting for somebody to explain to my why chroot() shouldn't
> > be changed to do this instead of adding a new syscall. (At least when
> > mount namespace support is enabled.)
> Because chroot has no effect on the namespace at all. If you fork and
> the child chroots, the parent isn't chrooted. And, more importantly
> for my example, is a process has it's cwd as /foo, and then it forks
> and the child chroots, then parent's ".." isn't changed as a result of
> the chroot.
> >
> >> mount(2) and friends are way too multiplexed right now. I just found
> >> yet another security bug due to the insanely complicated semantics of
> >> the vfs syscalls. (Yes, a different one from the one yesterday.)
> >
> > As the guy who rewrote busybox mount 3 times, and who just implemented a
> > brand new one (toybox) from scratch:
> >
> > It's a bit fiddly, yes.
> >
> >> A new operation kills several birds with one stone. It could look like:
> >>
> >> int mntns_change_root(int dfd, const char *path, int flags);
> >>
> >> return -EPERM if chrooted.
> >
> > Really?
> Now that CVE-2014-7970 is public: what the heck is pivot_root supposed
> to do if the caller is chrooted? The current behavior is obviously
> incorrect (it leaks memory), but it's not entirely clear to me what
> should happen. I think it should either be disallowed or should have
> well-defined semantics.
> For simplicity, if a new syscall for this is added, then I think that
> the caller-is-chrooted case should be disallowed. If someone needs it
> and can articulate what the semantics should be, then I have no
> problem with allowing it going forward.

It's not that I'd have a need for that, but rather if for some
reason I started out chrooted due to some bogus initramfs, I'd
prefer to not have to feel like a criminial and escape the chroot
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at