Re: [PATCH v2 00/28] user_namespace: introduce fsid mappings

From: James Bottomley
Date: Mon Feb 17 2020 - 18:04:13 EST

On Mon, 2020-02-17 at 16:57 -0500, StÃphane Graber wrote:
> On Mon, Feb 17, 2020 at 4:12 PM James Bottomley <
> James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:
> > On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote:
> > [...]
> > > With this patch series we simply introduce the ability to create
> > > fsid mappings that are different from the id mappings of a user
> > > namespace. The whole feature set is placed under a config option
> > > that defaults to false.
> > >
> > > In the usual case of running an unprivileged container we will
> > > have setup an id mapping, e.g. 0 100000 100000. The on-disk
> > > mapping will correspond to this id mapping, i.e. all files which
> > > we want to appear as 0:0 inside the user namespace will be
> > > chowned to 100000:100000 on the host. This works, because
> > > whenever the kernel needs to do a filesystem access it will
> > > lookup the corresponding uid and gid in the idmapping tables of
> > > the container.
> > > Now think about the case where we want to have an id mapping of 0
> > > 100000 100000 but an on-disk mapping of 0 300000 100000 which is
> > > needed to e.g. share a single on-disk mapping with multiple
> > > containers that all have different id mappings.
> > > This will be problematic. Whenever a filesystem access is
> > > requested, the kernel will now try to lookup a mapping for 300000
> > > in the id mapping tables of the user namespace but since there is
> > > none the files will appear to be owned by the overflow id, i.e.
> > > usually 65534:65534 or nobody:nogroup.
> > >
> > > With fsid mappings we can solve this by writing an id mapping of
> > > 0 100000 100000 and an fsid mapping of 0 300000 100000. On
> > > filesystem access the kernel will now lookup the mapping for
> > > 300000 in the fsid mapping tables of the user namespace. And
> > > since such a mapping exists, the corresponding files will have
> > > correct ownership.
> >
> > How do we parametrise this new fsid shift for the unprivileged use
> > case? For newuidmap/newgidmap, it's easy because each user gets a
> > dedicated range and everything "just works (tm)". However, for the
> > fsid mapping, assuming some newfsuid/newfsgid tool to help, that
> > tool has to know not only your allocated uid/gid chunk, but also
> > the offset map of the image. The former is easy, but the latter is
> > going to vary by the actual image ... well unless we standardise
> > some accepted shift for images and it simply becomes a known static
> > offset.
> >
> For unprivileged runtimes, I would expect images to be unshifted and
> be unpacked from within a userns.

For images whose resting format is an archive like tar, I concur.

> So your unprivileged user would be allowed a uid/gid range through
> /etc/subuid and /etc/subgid and allowed to use them through
> newuidmap/newgidmap.In that namespace, you can then pull
> and unpack any images/layers you may want and the resulting fs tree
> will look correct from within that namespace.
> All that is possible today and is how for example unprivileged LXC
> works right now.

I do have a counter example, but it might be more esoteric: I do use
unprivileged architecture emulation containers to maintain actual
physical system boot environments. These are stored as mountable disk
images, not as archives, so I do need a simple remapping ... however, I
think this use case is simple: it's a back shift along my owned uid/gid
range, so tools for allowing unprivileged use can easily cope with this
use case, so the use is either fsid identity or fsid back along
existing user_ns mapping.

> What this patchset then allows is for containers to have differing
> uid/gid maps while still being based off the same image or layers.
> In this scenario, you would carve a subset of your main uid/gid map
> for each container you run and run them in a child user namespace
> while setting up a fsuid/fsgid map such that their filesystem access
> do not follow their uid/gid map. This then results in proper
> isolation for processes, networks, ... as everything runs as
> different kuid/kgid but the VFS view will be the same in all
> containers.

Who owns the shifted range of the image ... all tenants or none?

> Shared storage between those otherwise isolated containers would also
> work just fine by simply bind-mounting the same path into two or more
> containers.
> Now one additional thing that would be safe for a setuid wrapper to
> allow would be for arbitrary mapping of any of the uid/gid that the
> user owns to be used within the fsuid/fsgid map. One potential use
> for this would be to create any number of user namespaces, each with
> their own mapping for uid 0 while still having all VFS access be
> mapped to the user that spawned them (say uid=1000, gid=1000).
> Note that in our case, the intended use for this is from a privileged
> runtime where our images would be unshifted as would be the container
> storage and any shared storage for containers. The security model
> effectively relying on properly configured filesystem permissions and
> mount namespaces such that the content of those paths can never be
> seen by anyone but root outside of those containers (and therefore
> avoids all the issues around setuid/setgid/fscaps).

Yes, I understand ... all orchestration systems are currently hugely
privileged. However, there is interest in getting them down to only
"slightly privileged".


> We will then be able to allocate distinct, random, ranges of 65536
> uids/gids (or more) for each container without ever having to do any
> uid/gid shifting at the filesystem layer or run into issues when
> having to setup shared storage between containers or attaching
> external storage volumes to those containers.