Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: Eric W. Biederman
Date: Sat May 14 2016 - 22:33:19 EST


James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> writes:

> On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:

Just a couple of quick comments from a very high level design point.

- I think a shiftfs is valuable in the same way that overlayfs is
valuable.

Esepcially in the Docker case where a lot of containers want a shared
base image (for efficiency), but it is desirable to run those
containers in different user namespaces for safety.

- It is also the plan to make it possible to mount a filesystem where
the uids and gids of that filesystem on disk do not have a one to one
mapping to kernel uids and gids. 99% of the work has already be done,
for all filesystem except XFS.

That said there are some significant issues to work through, before
something like that can be enabled.

* Handling of uids/gids on disk that don't map into a kuid/kgid.
* Safety from poisoned filesystem images.

I have slowly been working with Seth Forshee on these issues as
the last thing I want is to introduce more security bugs right now.
Seth being a braver man than I am has already merged his changes into
the Ubuntu kernel.

Right now we are targeting fuse, because fuse is already designed to
handle poisoned filesystem images. So to safely enable this kind of
mapping for fuse is not a giant step.

The big thing from my point of view is to get the VFS interfaces
correct so that the VFS handles all of the weird cases that come up
with uids and gids that don't map, and any other weird cases. Keeping
the weird bits out of the filesystems.

James, Djalal I regert I have not been able to read through either of
your patches cloesely yet. From a high level view I believe there are
use cases for both approaches, and the use cases do not necessarily
overlap.

Djalal I think you are seeing the upsides and not the practical dangers
of poisoned filesystem images.

James I think you are missing the fact that all filesystems already have
the make_kuid and make_kgid calls right where the data comes off disk,
and the from_kuid and from_kgid calls right where the on-disk data is
being created just before it goes on disk. Which means that the actual
impact on filesystems of the translation is trivial.

Where the actual impact of filesystems is much higher is the
infrastructure needed to ensure poisoned filesystem images do not cause
a kernel compromise. That extends to the filesystem testing and code
review process beyond and is more than just a kernel problem. Hardening
that attack surface of the disk side of filesystems is difficult
especially when not impacting filesystem performance.


So I don't think it makes sense to frame this as an either/or situation.
I think there is a need for both solutions.

Djalal if you could work with Seth I think that would be very useful. I
know I am dragging my heels there but I really hope I can dig in and get
everything reviewed and merged soonish.

James if you could see shiftfs with a different set of merits than what
to Djalal is doing I think that would be useful. As it would allow
everyone to concentrate on getting the bugs out of their solutions.



That said I am not certain shiftfs makes sense without Seth's patches to
handle the weird cases at the VFS level. What do you do with uids and
gids that don't map? You can reinvent how to handle the strange cases
in shfitfs or we can work on solving this problem at the VFS level so
people don't have to go through the error prone work of reinventing
solutions.


The big ugly nasty in all of this is that we are fundamentally dealing
with uids and gids which are security identifiers. Practically any bug
is exploitable and CVE worthy. So it make sense to tread very
carefully. Even with care it can takes months if not years to get
the number of bugs down to a level where you are not the favorite target
of people looking for exploitable kernel bugs.

Eric