Re: [PATCH 00/24] user_namespace: introduce fsid mappings

From: Christian Brauner
Date: Wed Feb 12 2020 - 09:51:55 EST


On Tue, Feb 11, 2020 at 09:55:46PM +0100, Jann Horn via Containers wrote:
> On Tue, Feb 11, 2020 at 5:59 PM Christian Brauner
> <christian.brauner@xxxxxxxxxx> wrote:
> > This is the implementation of shiftfs which was cooked up during lunch at
> > Linux Plumbers 2019 the day after the container's microconference. The
> > idea is a design-stew from StÃphane, Aleksa, Eric, and myself. Back then
> > we all were quite busy with other work and couldn't really sit down and
> > implement it. But I took a few days last week to do this work, including
> > demos and performance testing.
> > This implementation does not require us to touch the vfs substantially
> > at all. Instead, we implement shiftfs via fsid mappings.
> > With this patch, it took me 20 mins to port both LXD and LXC to support
> > shiftfs via fsid mappings.
> >
> > For anyone wanting to play with this the branch can be pulled from:
> > https://github.com/brauner/linux/tree/fsid_mappings
> > https://gitlab.com/brauner/linux/-/tree/fsid_mappings
> > https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=fsid_mappings
> >
> > The main use case for shiftfs for us is in allowing shared writable
> > storage to multiple containers using non-overlapping id mappings.
> > In such a scenario you want the fsids to be valid and identical in both
> > containers for the shared mount. A demo for this exists in [3].
> > If you don't want to read on, go straight to the other demos below in
> > [1] and [2].
>
> I guess essentially this means that you want to have UID separation
> between containers to prevent the containers - or their owners - from
> interfering between each other, but for filesystem access, you don't
> want to isolate them from each other using DAC controls on the files
> and folders inside the containers' directory hierarchies, instead
> relying on mode-0700 parent directories to restrict access to the
> container owner? Or would you still have separate UIDs for e.g. the
> container's UID range 0-65535, and then map the shared UID range at
> 100000, or something like that?

Yes.
So if you look at the permissions right now for the directory under
which the rootfs for the container and other stuff resides we have
root@wittgenstein|/var/lib/lxd/storage-pools/zfs/containers
> perms *
d--x------ 100 alp1
d--x------ 100 f1
d--x------ 100 f2

We don't really share the rootfs between containers right now since we
treat them as standalone systems but with fsid mappings that's possible
too. Layer-sharing-centric runtimes very much will want something like
that.

>
> > People not as familiar with user namespaces might not be aware that fsid
> > mappings already exist. Right now, fsid mappings are always identical to
> > id mappings. Specifically, the kernel will lookup fsuids in the uid
> > mappings and fsgids in the gid mappings of the relevant user namespace.
>
> That's a bit like saying that a kernel without CONFIG_USER_NS still
> has user ID mappings, they just happen to be identity mappings. :P

If you have CONFIG_USER_NS=n then you have (as you're well aware)
[<0, 0>, <1,1>, ..., <n,n>] so yeah that's true and analyzing it like
that makes sense. :P

>
> > With this patch series we simply introduce the ability to create fsid
> > mappings that are different from the id mappings of a user namespace.
> >
> > In the usual case of running an unprivileged container we will have
> > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
> > correspond to this id mapping, i.e. all files which we want to appear as
> > 0:0 inside the user namespace will be chowned to 100000:100000 on the
> > host. This works, because whenever the kernel needs to do a filesystem
> > access it will lookup the corresponding uid and gid in the idmapping
> > tables of the container.
> > Now think about the case where we want to have an id mapping of 0 100000
> > 100000 but an on-disk mapping of 0 300000 100000 which is needed to e.g.
> > share a single on-disk mapping with multiple containers that all have
> > different id mappings.
> > This will be problematic. Whenever a filesystem access is requested, the
> > kernel will now try to lookup a mapping for 300000 in the id mapping
> > tables of the user namespace but since there is none the files will
> > appear to be owned by the overflow id, i.e. usually 65534:65534 or
> > nobody:nogroup.
> >
> > With fsid mappings we can solve this by writing an id mapping of 0
> > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> > access the kernel will now lookup the mapping for 300000 in the fsid
> > mapping tables of the user namespace. And since such a mapping exists,
> > the corresponding files will have correct ownership.
>
> Sorry to bring up something as disgusting as setuid execution, but:

No that's exactly what this needs. :)

> What happens when there's a setuid root file with ->i_uid==300000? I
> guess the only way to make that work inside the containers would be
> something like make_kuid(current_user_ns(),
> from_kfsuid(current_user_ns(), inode->i_uid)) in the setuid execve
> path?

What's the specific callpath you're thinking about?

So if you look at patch
https://lore.kernel.org/lkml/20200211165753.356508-16-christian.brauner@xxxxxxxxxx/
it does
- new->suid = new->fsuid = new->euid;
- new->sgid = new->fsgid = new->egid;
+ fsuid = from_kuid_munged(new->user_ns, new->euid);
+ kfsuid = make_kfsuid(new->user_ns, fsuid);
+ new->suid = new->euid;
+ new->fsuid = kfsuid;
+
+ fsgid = from_kgid_munged(new->user_ns, new->egid);
+ kfsgid = make_kfsgid(new->user_ns, fsgid);
+ new->sgid = new->egid;
+ new->fsgid = kfsgid;

One thing I definitely missed though in the setuid path is to adapt
fs/exec.c:bprm_fill_uid():

diff --git a/fs/exec.c b/fs/exec.c
index 74d88dab98dd..ad839934fdff 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1547,8 +1547,8 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
inode_unlock(inode);

/* We ignore suid/sgid if there are no mappings for them in the ns */
- if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
- !kgid_has_mapping(bprm->cred->user_ns, gid))
+ if (!kfsuid_has_mapping(bprm->cred->user_ns, uid) ||
+ !kfsgid_has_mapping(bprm->cred->user_ns, gid))
return;

if (mode & S_ISUID) {

>
> > A note on proc (and sys), the proc filesystem is special in sofar as it
> > only has a single superblock that is (currently but might be about to
> > change) visible in all user namespaces (same goes for sys). This means
> > it has special semantics in many ways, including how file ownership and
> > access works. The fsid mapping implementation does not alter how proc
> > (and sys) ownership works. proc and sys will both continue to lookup
> > filesystem access in id mapping tables.
>
> In your example, a process with namespaced UID set (0, 0, 0, 0) will
> have kernel UIDs (100000, 100000, 100000, 300000), right? And then if

Yes.

> I want to open /proc/$pid/personality of another process with the same
> UIDs, may_open() will call inode_permission() -> do_inode_permission()
> -> generic_permission() -> acl_permission_check(), which will compare
> current_fsuid() (which is 300000) against inode->i_uid. But
> inode->i_uid was filled by proc_pid_make_inode()->task_dump_owner(),
> which set inode->i_uid to 100000, right?

Yes. That should be fixable by something like below, I think. (And we can
probably shortcut this by adding a helper that does tell us whether there's
been any fsid mapping setup or not for this user namespace.)
static int acl_permission_check(struct inode *inode, int mask)
{
+ kuid_t kuid;
unsigned int mode = inode->i_mode;

- if (likely(uid_eq(current_fsuid(), inode->i_uid)))
+ if (!is_userns_visible(inode->i_sb->s_iflags)) {
+ kuid = inode->i_uid;
+ } else {
+ kuid = make_kuid(current_user_ns(),
+ from_kfsuid(current_user_ns(), inode->i_uid));
+ }
+
+ if (likely(uid_eq(current_fsuid(), kuid)))
mode >>= 6;
else {&& (mode & S_IRWXG)) {

>
> Also, e.g. __ptrace_may_access() uses cred->fsuid for a comparison
> with another task's real/effective/saved UID.

Right, you even introduced this check in 2015 iirc.
Both of your points make me think that it'd be easiest to introduce
cred->{kfsuid,kfsgid} and whenever an access decision on a
is_userns_visible() filesystem has to be made those will be used. This avoids
having to do on-the fly translations and ptrace_may_access() can just grow a
flag indicating what fscreds it's supposed to look at?

>
> [...]
> > # Demos
> > [1]: Create a container with different id and fsid mappings.
> > https://asciinema.org/a/300233
> > [2]: Create a container with id mappings but without fsid mappings.
> > https://asciinema.org/a/300234
> > [3]: Share storage between multiple containers with non-overlapping id
> > mappings.
> > https://asciinema.org/a/300235
>
> (I really dislike this asciinema thing; if you want to quickly glance
> through the output instead of reading at the same speed as it was
> typed, a simple pastebin works much better unless you absolutely have
> to show things that use stuff like ncurses UI.)

Hmkay, I went through the trouble of converting the asciinema output to
basic shell for all tree demos. :) I made them available as github gists.
So:
demo1: https://gist.github.com/brauner/8e1117720b3f9fab22e44c17f12184bf
demo2: https://gist.github.com/brauner/41a36026a9a1496af0095dce1545548e
demo3: https://gist.github.com/brauner/4586d6bc680a018bc8e1dd114a45592a