Re: [PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces

From: Trond Myklebust
Date: Wed Nov 11 2020 - 09:40:20 EST

Next message: Stanimir Varbanov: "[PATCH v2 4/8] venus: helpers: Calculate properly compressed buffer size"
Previous message: Ben Hutchings: "Re: [PATCH 4.19 19/71] btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()"
In reply to: Sargun Dhillon: "Re: [PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces"
Next in thread: Sargun Dhillon: "Re: [PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 2020-11-11 at 11:12 +0000, Sargun Dhillon wrote:
> On Tue, Nov 10, 2020 at 08:12:01PM +0000, Trond Myklebust wrote:
> > On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote:
> > > Hi,
> > >
> > > I tested the patches on top of 5.10.0-rc3+ and I could mount an
> > > NFS
> > > share with a different user namespace. fsopen() is done in the
> > > container namespaces (user, mnt and net namespaces) while
> > > fsconfig(),
> > > fsmount() and move_mount() are done on the host namespaces. The
> > > mount
> > > on the host is available in the container via mount propagation
> > > from
> > > the host mount.
> > >
> > > With this, the files on the NFS server with uid 0 are available
> > > in
> > > the
> > > container with uid 0. On the host, they are available with uid
> > > 4294967294 (make_kuid(&init_user_ns, -2)).
> > >
> >
> > Can someone please tell me what is broken with the _current_ design
> > before we start trying to push "fixes" that clearly break it?
> Currently the mechanism of mounting nfs4 in a user namespace is as
> follows:
>
> Parent: fork()
> Child: setns(userns)
> C: fsopen("nfs4") = 3
> C->P: Send FD 3
> P: FSConfig...
> P: fsmount... (This is where the CAP_SYS_ADMIN check happens))
>
>
> Right now, when you mount an NFS filesystem in a non-init user
> namespace, and you have UIDs / GIDs on, the UIDs / GIDs which
> are sent to the server are not the UIDs from the mounting namespace,
> instead they are the UIDs from the init user ns.
>
> The reason for this is that you can call fsopen("nfs4") in the
> unprivileged
> namespace, and that configures fs_context with all the right
> information for
> that user namespace, but we currently require CAP_SYS_ADMIN in the
> init user
> namespace to call fsmount. This means that the superblock's user
> namespace is
> set "correctly" to the container, but there's absolutely no way
> nfs4uidmap
> to consume an unprivileged user namespace.
>
> This behaviour happens "the other way" as well, where the UID in the
> container
> may be 0, but the corresponding kuid is 1000. When a response from an
> NFS
> server comes in we decode it according to the idmap userns[1]. The
> userns
> used to get create idmap is generated at fsmount time, and not as
> fsopen
> time. So, even if the filesystem is in the user namespace, and the
> server
> responds with UID 0, it'll come up with an unmapped UID.
>
> This is because we do
> Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS
> from_kuid(container_ns, 0) -> invalid uid
>
> This is broken behaviour, in my humble opinion as is it makes it
> impossible to
> use NFSv4 (and v3 for that matter) out of the box with unprivileged
> user
> namespaces. At least in our environment, using usernames / GSS isn't
> an option,
> so we have to rely on UIDs being set correctly [at least from the
> container's
> perspective].
>

The current code for setting server->cred was developed independently
of fsopen() (and predates it actually). I'm fine with the change to
have server->cred be the cred of the user that called fsopen(). That's
in line with what we used to do for sys_mount().

However all the other stuff to throw errors when the user namespace is
not init_user_ns introduces massive regressions.

> >
> > The current design assumes that the user namespace being used is
> > the one where
> > the mount itself is performed. That means that the uids and gids or
> > usernames
> > and groupnames that go on the wire match the uids and gids of the
> > container in
> > which the mount occurred.
> >
>
> Right now, NFS does not have the ability for the fsmount() call to be
> called in an unprivileged user namespace. We can change that
> behaviour
> elsewhere if we want, but it's orthogonal to this.
>
> > The assumption is that the server has authenticated that client as
> > belonging to a domain that it recognises (either through strong
> > RPCSEC_GSS/krb5 authentication, or through weaker matching of IP
> > addresses to a list of acceptable clients).
> >
> I added a rejection for upcalls because upcalls can happen in the
> init
> namespaces. We can drop that restriction from the nfs4 patch if you'd
> like. I
> *believe* (and I'm not a little out of my depth) that the request-key
> handler gets called with the *network namespace* of the NFS mount,
> but the userns is a privileged one, allowing for potential hazards.
>

The idmapper already rejects upcalls to the keyring '/sbin/request-key'
utility if you're running with your own user namespace.

Quite frankly, switching to using the keyring was a mistake which I'd
undo if I could. Aside from not supporting containers, it is horribly
slow due to requiring a full process startup/teardown for every upcall,
so it scales poorly to large numbers of identities (particularly with
an operation like readdir() in which you're doing serial upcalls).

However nothing stops you from using the old NFSv4 idmapper daemon
(a.k.a. rpc.idmapd) in the context of the container that called
fsopen() so that it can translate identities correctly using whatever
userspace tools (ldap, sssd, winbind...) that the container has
configured.

> The reason I added that block there is that I didn't imagine anyone
> was running
> NFS in an unprivileged user namespace, and relying on upcalls
> (potentially into
> privileged namespaces) in order to do authz.
>
>
> > If you go ahead and change the user namespace on the client without
> > going through the mount process again to mount a different super
> > block
> > with a different user namespace, then you will now get the exact
> > same
> > behaviour as if you do that with any other filesystem.
>
> Not exactly, because other filesystems *only* use the s_user_ns for
> conversion
> of UIDs, whereas NFS uses the currend_cred() acquired at mount time,
> which
> doesn't match s_user_ns, leading to this behaviour.
>
> 1. Mistranslated UIDs in encoding RPCs
> 2. The UID / GID exposed to VFS do not match the user ns.
>
> >
> > --
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > trond.myklebust@xxxxxxxxxxxxxxx
> >
> >
> -Thanks,
> Sargun
>
> [1]:
> https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4idmap.c#L782
> [2]:
> https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4client.c#L1154

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx

Next message: Stanimir Varbanov: "[PATCH v2 4/8] venus: helpers: Calculate properly compressed buffer size"
Previous message: Ben Hutchings: "Re: [PATCH 4.19 19/71] btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()"
In reply to: Sargun Dhillon: "Re: [PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces"
Next in thread: Sargun Dhillon: "Re: [PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]