Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: James Bottomley
Date: Mon May 16 2016 - 15:13:26 EST


On Sat, 2016-05-14 at 21:21 -0500, Eric W. Biederman wrote:
> James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> writes:
>
> > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
>
> Just a couple of quick comments from a very high level design point.
>
> - I think a shiftfs is valuable in the same way that overlayfs is
> valuable.
>
> Esepcially in the Docker case where a lot of containers want a shared
> base image (for efficiency), but it is desirable to run those
> containers in different user namespaces for safety.
>
> - It is also the plan to make it possible to mount a filesystem where
> the uids and gids of that filesystem on disk do not have a one to one
> mapping to kernel uids and gids. 99% of the work has already be done,
> for all filesystem except XFS.

Can you elaborate a bit more on why we want to do this? I think only
having a single shift of uid_t to kuid_t across the kernel to user
boundary is a nice feature of user namespaces. Architecturally, it's
not such a big thing to do it as the data goes on to the disk as well,
but what's the use case for it?

> That said there are some significant issues to work through, before
> something like that can be enabled.
>
> * Handling of uids/gids on disk that don't map into a kuid/kgid.

So I think this is nicely handled in the capability checks in
generic_permission() (capable_wrt_inode_uidgid()) is there a need to
make it more complex (and thus more error prone)?

> * Safety from poisoned filesystem images.

By poisoned FS image, you mean an image over whose internal data the
user has control? The basic problem of how do we give users write
access to data devices they can then cause to be mounted as
filesystems?

> I have slowly been working with Seth Forshee on these issues as
> the last thing I want is to introduce more security bugs right now.
> Seth being a braver man than I am has already merged his changes into
> the Ubuntu kernel.
>
> Right now we are targeting fuse, because fuse is already designed to
> handle poisoned filesystem images. So to safely enable this kind of
> mapping for fuse is not a giant step.
>
> The big thing from my point of view is to get the VFS interfaces
> correct so that the VFS handles all of the weird cases that come up
> with uids and gids that don't map, and any other weird cases. Keeping
> the weird bits out of the filesystems.

If by VFS interfaces, you mean where we've already got the mapping
confined, absolutely.

> James I think you are missing the fact that all filesystems already
> have the make_kuid and make_kgid calls right where the data comes off
> disk,

I beg to differ: they certainly don't. The underlying filesystem
populates the inode in ->lookup with the data off the disk which goes
into the inode as a kuid_t/kgid_t It remains forever in the inode as
that. We convert it as it goes out of the kernel in the stat calls
(actually stat.c:cp_old/new_stat())

> and the from_kuid and from_kgid calls right where the on-disk data
> is being created just before it goes on disk. Which means that the
> actual impact on filesystems of the translation is trivial.

Are you looking at a different tree from me? I'm actually just looking
at Linus git head.

James