Re: [PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces
From: Trond Myklebust
Date: Wed Nov 11 2020 - 15:03:25 EST
On Wed, 2020-11-11 at 18:57 +0000, Sargun Dhillon wrote:
> On Wed, Nov 11, 2020 at 02:38:11PM +0000, Trond Myklebust wrote:
> > On Wed, 2020-11-11 at 11:12 +0000, Sargun Dhillon wrote:
> > > On Tue, Nov 10, 2020 at 08:12:01PM +0000, Trond Myklebust wrote:
> > > > On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote:
> > > > > Hi,
> > > > >
> > > > > I tested the patches on top of 5.10.0-rc3+ and I could mount
> > > > > an
> > > > > NFS
> > > > > share with a different user namespace. fsopen() is done in
> > > > > the
> > > > > container namespaces (user, mnt and net namespaces) while
> > > > > fsconfig(),
> > > > > fsmount() and move_mount() are done on the host namespaces.
> > > > > The
> > > > > mount
> > > > > on the host is available in the container via mount
> > > > > propagation
> > > > > from
> > > > > the host mount.
> > > > >
> > > > > With this, the files on the NFS server with uid 0 are
> > > > > available
> > > > > in
> > > > > the
> > > > > container with uid 0. On the host, they are available with
> > > > > uid
> > > > > 4294967294 (make_kuid(&init_user_ns, -2)).
> > > > >
> > > >
> > > > Can someone please tell me what is broken with the _current_
> > > > design
> > > > before we start trying to push "fixes" that clearly break it?
> > > Currently the mechanism of mounting nfs4 in a user namespace is
> > > as
> > > follows:
> > >
> > > Parent: fork()
> > > Child: setns(userns)
> > > C: fsopen("nfs4") = 3
> > > C->P: Send FD 3
> > > P: FSConfig...
> > > P: fsmount... (This is where the CAP_SYS_ADMIN check happens))
> > >
> > >
> > > Right now, when you mount an NFS filesystem in a non-init user
> > > namespace, and you have UIDs / GIDs on, the UIDs / GIDs which
> > > are sent to the server are not the UIDs from the mounting
> > > namespace,
> > > instead they are the UIDs from the init user ns.
> > >
> > > The reason for this is that you can call fsopen("nfs4") in the
> > > unprivileged
> > > namespace, and that configures fs_context with all the right
> > > information for
> > > that user namespace, but we currently require CAP_SYS_ADMIN in
> > > the
> > > init user
> > > namespace to call fsmount. This means that the superblock's user
> > > namespace is
> > > set "correctly" to the container, but there's absolutely no way
> > > nfs4uidmap
> > > to consume an unprivileged user namespace.
> > >
> > > This behaviour happens "the other way" as well, where the UID in
> > > the
> > > container
> > > may be 0, but the corresponding kuid is 1000. When a response
> > > from an
> > > NFS
> > > server comes in we decode it according to the idmap userns[1].
> > > The
> > > userns
> > > used to get create idmap is generated at fsmount time, and not as
> > > fsopen
> > > time. So, even if the filesystem is in the user namespace, and
> > > the
> > > server
> > > responds with UID 0, it'll come up with an unmapped UID.
> > >
> > > This is because we do
> > > Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS
> > > from_kuid(container_ns, 0) -> invalid uid
> > >
> > > This is broken behaviour, in my humble opinion as is it makes it
> > > impossible to
> > > use NFSv4 (and v3 for that matter) out of the box with
> > > unprivileged
> > > user
> > > namespaces. At least in our environment, using usernames / GSS
> > > isn't
> > > an option,
> > > so we have to rely on UIDs being set correctly [at least from the
> > > container's
> > > perspective].
> > >
> >
> > The current code for setting server->cred was developed
> > independently
> > of fsopen() (and predates it actually). I'm fine with the change to
> > have server->cred be the cred of the user that called fsopen().
> > That's
> > in line with what we used to do for sys_mount().
> >
> Just curious, without FS_USERNS, how were you mounting NFSv4 in an
> unprivileged user ns?
The code was originally developed on a 5.1 kernel. So all my testing
has been with ordinary sys_mount() calls in a container that had
CAP_SYS_ADMIN privileges.
> > However all the other stuff to throw errors when the user namespace
> > is
> > not init_user_ns introduces massive regressions.
> >
>
> I can remove that and respin the patch. How do you feel about that?
> I would
> still like to keep the log lines though because it is a uapi change.
> I am
> worried that someone might exercise this path with GSS and allow for
> upcalls
> into the main namespaces by accident -- or be confused of why they're
> seeing
> upcalls "in a different namespace".
>
> Are you okay with picking up ("NFS: NFSv2/NFSv3: Use cred from
> fs_context during
> mount") without any changes?
Why do we need the dprintk()s? It seems to me that either they should
be reporting something that the user needs to know (in which case they
should be real printk()s) or they are telling us something that we
should already know. To me they seem to fit more in the latter
category.
>
> I can respin ("NFSv4: Refactor NFS to use user namespaces") without:
> /*
> * nfs4idmap is not fully isolated by user namespaces. It is
> currently
> * only network namespace aware. If upcalls never happen, we do not
> * need to worry as nfs_client instances aren't shared between
> * user namespaces.
> */
> if (idmap_userns(server->nfs_client->cl_idmap) != &init_user_ns &&
> !(server->caps & NFS_CAP_UIDGID_NOMAP)) {
> error = -EINVAL;
> errorf(fc, "Mount credentials are from non init user
> namespace and ID mapping is enabled. This is not allowed.");
> goto error;
> }
>
> (and making it so we can call idmap_userns)
>
Yes. That would be acceptable. Again, though, I'd like to see the
dprintk()s gone.
> > > >
> > > > The current design assumes that the user namespace being used
> > > > is
> > > > the one where
> > > > the mount itself is performed. That means that the uids and
> > > > gids or
> > > > usernames
> > > > and groupnames that go on the wire match the uids and gids of
> > > > the
> > > > container in
> > > > which the mount occurred.
> > > >
> > >
> > > Right now, NFS does not have the ability for the fsmount() call
> > > to be
> > > called in an unprivileged user namespace. We can change that
> > > behaviour
> > > elsewhere if we want, but it's orthogonal to this.
> > >
> > > > The assumption is that the server has authenticated that client
> > > > as
> > > > belonging to a domain that it recognises (either through strong
> > > > RPCSEC_GSS/krb5 authentication, or through weaker matching of
> > > > IP
> > > > addresses to a list of acceptable clients).
> > > >
> > > I added a rejection for upcalls because upcalls can happen in the
> > > init
> > > namespaces. We can drop that restriction from the nfs4 patch if
> > > you'd
> > > like. I
> > > *believe* (and I'm not a little out of my depth) that the
> > > request-key
> > > handler gets called with the *network namespace* of the NFS
> > > mount,
> > > but the userns is a privileged one, allowing for potential
> > > hazards.
> > >
> >
> > The idmapper already rejects upcalls to the keyring '/sbin/request-
> > key'
> > utility if you're running with your own user namespace.
> >
> > Quite frankly, switching to using the keyring was a mistake which
> > I'd
> > undo if I could. Aside from not supporting containers, it is
> > horribly
> > slow due to requiring a full process startup/teardown for every
> > upcall,
> > so it scales poorly to large numbers of identities (particularly
> > with
> > an operation like readdir() in which you're doing serial upcalls).
> >
> > However nothing stops you from using the old NFSv4 idmapper daemon
> > (a.k.a. rpc.idmapd) in the context of the container that called
> > fsopen() so that it can translate identities correctly using
> > whatever
> > userspace tools (ldap, sssd, winbind...) that the container has
> > configured.
> >
>
> 1. We see this as a potential security risk [this being upcalls] into
> the
> unconfined portion of the system. Although, I'm sure that the
> userspace handlers
> are written perfectly well, it allows for information leakage to
> occur.
>
> 2. Is there a way to do this for NFSv3?
>
> 3. Can rpc.idmapd get the user namespace that the call is from (and
> is the
> keyring per-userns?). In general, I think that this change follows
> the principal
> of least surprise.
>
> > > The reason I added that block there is that I didn't imagine
> > > anyone
> > > was running
> > > NFS in an unprivileged user namespace, and relying on upcalls
> > > (potentially into
> > > privileged namespaces) in order to do authz.
> > >
> > >
> > > > If you go ahead and change the user namespace on the client
> > > > without
> > > > going through the mount process again to mount a different
> > > > super
> > > > block
> > > > with a different user namespace, then you will now get the
> > > > exact
> > > > same
> > > > behaviour as if you do that with any other filesystem.
> > >
> > > Not exactly, because other filesystems *only* use the s_user_ns
> > > for
> > > conversion
> > > of UIDs, whereas NFS uses the currend_cred() acquired at mount
> > > time,
> > > which
> > > doesn't match s_user_ns, leading to this behaviour.
> > >
> > > 1. Mistranslated UIDs in encoding RPCs
> > > 2. The UID / GID exposed to VFS do not match the user ns.
> > >
> > >
> >
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx