Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

From: J. Bruce Fields
Date: Fri Feb 20 2015 - 12:26:06 EST


On Fri, Feb 20, 2015 at 05:33:25PM +0800, Ian Kent wrote:
> On Wed, 2015-02-18 at 20:31 -0500, J. Bruce Fields wrote:
> > On Thu, Feb 19, 2015 at 08:39:01AM +0800, Ian Kent wrote:
> > > On Wed, 2015-02-18 at 15:59 -0500, J. Bruce Fields wrote:
> > > > On Wed, Feb 18, 2015 at 12:31:32PM -0500, J. Bruce Fields wrote:
> > > > > On Wed, Feb 18, 2015 at 12:06:20PM -0500, J. Bruce Fields wrote:
> > > > > > On Fri, Feb 06, 2015 at 09:47:25AM +0800, Ian Kent wrote:
> > > > > > > On Thu, 2015-02-05 at 15:14 +0000, David Howells wrote:
> > > > > > > >
> > > > > > > > > + /* If running within a container use the container namespace */
> > > > > > > > > + if (current->nsproxy->net_ns != &init_net)
> > > > > > > >
> > > > > > > > Is that a viable check? Is it possible to have a container that shares
> > > > > > > > networking details?
> > > > > > >
> > > > > > > That's up for discussion.
> > > > > > >
> > > > > > > I thought about it and concluded that the check is probably not
> > > > > > > sufficient for any of the cases.
> > > > > > >
> > > > > > > I left it like that because I'm not sure exactly what the use cases are,
> > > > > > > hoping it promote discussion and here we are.
> > > > > > >
> > > > > > > I also think the current container environments don't share net
> > > > > > > namespace with the root init net namspace, necessarily, because thy are
> > > > > > > containers, ;)
> > > > > > >
> > > > > > > TBH I haven't looked at the user space container creation code but I
> > > > > > > expect it could be done that way if it was needed, so the answer is yes
> > > > > > > and no, ;)
> > > > > > >
> > > > > > > The questions then are do we need to check anything else, and what
> > > > > > > environment should the callback use in the different cases, and what
> > > > > > > other cases might break if we change it?
> > > > > > >
> > > > > > > For example, should the fs namespace also be checked for all of these
> > > > > > > cases, since we're executing a callback, or is whatever that's set to in
> > > > > > > the container always what's required for locating the executable.
> > > > > >
> > > > > > What would be the disadvantage of setting UMH_USE_NS unconditionally
> > > > > > here?
> > > > >
> > > > > In the nfs idmapping case, the mapping is per-nfs_client.
> > > > >
> > > > > Can nfs_idmap_new be the one that calls umh_get_init_task, with the
> > > > > corresponding put done in nfs_idmap_delete, or is there some reason that
> > > > > doesn't work?
> > > >
> > > > It's confusing sorting out possible use cases, but I think both of these
> > > > are reasonable:
> > > >
> > > > - mount an nfs filesystem from the host, then spawn containers
> > > > that all use that filesystem.
> > > > - mount an nfs filesystem from within a container.
> > > >
> > > > Your approach might work for the second, but in the first case we'll end
> > > > up with idmappers from multiple containers all trying to do the
> > > > idmapping for the shared filesystem.
> > >
> > > These patches are examples for context.
> > >
> > > Working out whether to run in a namespace or not was always going to be
> > > difficult, specifically for the case you point out. Maybe we can make
> > > use of some other information, namespace information in the super block
> > > perhaps, or something else, or perhaps we will need to add some
> > > information for this, not sure yet. We'll need to work together on that.
> > >
> > > TBH, I'm not that focused on the use cases because the base
> > > implementation is still undergoing significant change although I believe
> > > the use of a flag to request namespace execution is a good approach.
> > > That probably won't change.
> >
> > The flag requests that we use the container of the currently executing
> > task. In neither the nfs idmapper nor the nfsd state-recovery case is
> > that the correct choice.
>
> There's a bit more to do but I've done the changes arising from Oleg's
> comments so I've turned my attention to the recent discussion here.

Happy to take a look at the revised series and take a shot at the nfs
cases if it would help.

> Please understand that I added the example usage patches largely to get
> this sort of comment because I was pretty sure there would be more to it
> than what was in the patches.

Understood.

> I still don't fully understand the requirements so lets talk about them
> and to start with the former case above.
>
> I think your right that something needs to be saved in order to be able
> to make a decision at access time in a container. That might be as
> simple as working out whether the nn->umh_flags needs UMH_USE_NS at a
> different time, say at mount time. Taking a reference to something like
> the net namespace probably isn't necessary.
>
> But I don't understand the relationship between the net namespace (eg.
> is there a dependency on rpc requests and what the helper needs to do),
> the user in a container, and the idmapper operation. Can you help with
> that please.

Take the case where you mount an nfs filesystem and then spawn
containers that use it.

The on-the-wire nfs protocol uses names like user@domain to represent
file owners. The kernel of course needs an integer it can stick in
i_uid. So the upcall here is to ask userspace how to map the
on-the-wire name to a local uid.

The result is stored in inodes that will be shared between all the
containers. So the mapping has to be consistent across all of them.

So, if some user in a container does a stat() on this filesystem, we
can't do the mapping in that user's container. We have to use some
context that's the same for every user of the filesystem.

So, as I said, maybe:

- call umh_get_init_task from nfs_idmap_new, store the result in
a new idmap->idmap_init_task
- pass idmap_init_task to some request_key variant somehow in
nfs_idmap_request_key.
- clean up in nfs_idmap_delete

nfs_idmap_new is called at mount time, so you'll be getting the init
task of the container that the mount was originally done in. Sounds
like the least surprising result to me from the user's point of view.

I don't know if the task is actually the right thing to take the
reference on, or if it's OK to hold that reference for the life of the
filesystem. I think the important point is that the request_key upcalls
need to all happen consistently in one context for a given filesystem,
and that context should probably be derived from the mount's somehow.

> The case of nfsd state-recovery might be similar but you'll need to help
> me out a bit with that too.

Each network namespace can have its own virtual nfs server. Servers can
be started and stopped independently per network namespace. We decide
which server should handle an incoming rpc by looking at the network
namespace associated with the socket that it arrived over.

A server is started by the rpc.nfsd command writing a value into a magic
file somewhere. Part of the code that handles that writes ends up
calling nfsd4_umh_cltrack_create, which is our chance to capture
anything from the rpc.nfsd process. So grabbing the right init task
there might be the thing to do?

The actual call_usermode_helper calls happen later as part of processing
rpc's. At that point you're running in a kernel thread and I assume
umh_get_init_task is always just going to return the global init.

I think you'd need to add an nfsd4_umh_track_ops.exit to do the cleanup
on nfsd shutdown.

Anyway, the important point again is that the process calling
usermode_helper call doesn't provide useful context. The upcalls need
to happen consistently in one context for a given virtual nfs server,
and that context should probably be derived from rpc.nfsd's somehow.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/