Re: [PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces

From: Sargun Dhillon
Date: Wed Nov 11 2020 - 06:12:58 EST


On Tue, Nov 10, 2020 at 08:12:01PM +0000, Trond Myklebust wrote:
> On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote:
> > Hi,
> >
> > I tested the patches on top of 5.10.0-rc3+ and I could mount an NFS
> > share with a different user namespace. fsopen() is done in the
> > container namespaces (user, mnt and net namespaces) while fsconfig(),
> > fsmount() and move_mount() are done on the host namespaces. The mount
> > on the host is available in the container via mount propagation from
> > the host mount.
> >
> > With this, the files on the NFS server with uid 0 are available in
> > the
> > container with uid 0. On the host, they are available with uid
> > 4294967294 (make_kuid(&init_user_ns, -2)).
> >
>
> Can someone please tell me what is broken with the _current_ design
> before we start trying to push "fixes" that clearly break it?
Currently the mechanism of mounting nfs4 in a user namespace is as follows:

Parent: fork()
Child: setns(userns)
C: fsopen("nfs4") = 3
C->P: Send FD 3
P: FSConfig...
P: fsmount... (This is where the CAP_SYS_ADMIN check happens))


Right now, when you mount an NFS filesystem in a non-init user
namespace, and you have UIDs / GIDs on, the UIDs / GIDs which
are sent to the server are not the UIDs from the mounting namespace,
instead they are the UIDs from the init user ns.

The reason for this is that you can call fsopen("nfs4") in the unprivileged
namespace, and that configures fs_context with all the right information for
that user namespace, but we currently require CAP_SYS_ADMIN in the init user
namespace to call fsmount. This means that the superblock's user namespace is
set "correctly" to the container, but there's absolutely no way nfs4uidmap
to consume an unprivileged user namespace.

This behaviour happens "the other way" as well, where the UID in the container
may be 0, but the corresponding kuid is 1000. When a response from an NFS
server comes in we decode it according to the idmap userns[1]. The userns
used to get create idmap is generated at fsmount time, and not as fsopen
time. So, even if the filesystem is in the user namespace, and the server
responds with UID 0, it'll come up with an unmapped UID.

This is because we do
Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS from_kuid(container_ns, 0) -> invalid uid

This is broken behaviour, in my humble opinion as is it makes it impossible to
use NFSv4 (and v3 for that matter) out of the box with unprivileged user
namespaces. At least in our environment, using usernames / GSS isn't an option,
so we have to rely on UIDs being set correctly [at least from the container's
perspective].


>
> The current design assumes that the user namespace being used is the one where
> the mount itself is performed. That means that the uids and gids or usernames
> and groupnames that go on the wire match the uids and gids of the container in
> which the mount occurred.
>

Right now, NFS does not have the ability for the fsmount() call to be
called in an unprivileged user namespace. We can change that behaviour
elsewhere if we want, but it's orthogonal to this.

> The assumption is that the server has authenticated that client as
> belonging to a domain that it recognises (either through strong
> RPCSEC_GSS/krb5 authentication, or through weaker matching of IP
> addresses to a list of acceptable clients).
>
I added a rejection for upcalls because upcalls can happen in the init
namespaces. We can drop that restriction from the nfs4 patch if you'd like. I
*believe* (and I'm not a little out of my depth) that the request-key
handler gets called with the *network namespace* of the NFS mount,
but the userns is a privileged one, allowing for potential hazards.

The reason I added that block there is that I didn't imagine anyone was running
NFS in an unprivileged user namespace, and relying on upcalls (potentially into
privileged namespaces) in order to do authz.


> If you go ahead and change the user namespace on the client without
> going through the mount process again to mount a different super block
> with a different user namespace, then you will now get the exact same
> behaviour as if you do that with any other filesystem.

Not exactly, because other filesystems *only* use the s_user_ns for conversion
of UIDs, whereas NFS uses the currend_cred() acquired at mount time, which
doesn't match s_user_ns, leading to this behaviour.

1. Mistranslated UIDs in encoding RPCs
2. The UID / GID exposed to VFS do not match the user ns.

>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@xxxxxxxxxxxxxxx
>
>
-Thanks,
Sargun

[1]: https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4idmap.c#L782
[2]: https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4client.c#L1154