Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

From: J. Bruce Fields
Date: Mon Feb 23 2015 - 09:52:42 EST

On Sat, Feb 21, 2015 at 11:58:58AM +0800, Ian Kent wrote:
> On Fri, 2015-02-20 at 14:05 -0500, J. Bruce Fields wrote:
> > On Fri, Feb 20, 2015 at 12:07:15PM -0600, Eric W. Biederman wrote:
> > > "J. Bruce Fields" <bfields@xxxxxxxxxxxx> writes:
> > >
> > > > On Fri, Feb 20, 2015 at 05:33:25PM +0800, Ian Kent wrote:
> > >
> > > >> The case of nfsd state-recovery might be similar but you'll need to help
> > > >> me out a bit with that too.
> > > >
> > > > Each network namespace can have its own virtual nfs server. Servers can
> > > > be started and stopped independently per network namespace. We decide
> > > > which server should handle an incoming rpc by looking at the network
> > > > namespace associated with the socket that it arrived over.
> > > >
> > > > A server is started by the rpc.nfsd command writing a value into a magic
> > > > file somewhere.
> > >
> > > nit. Unless I am completely turned around that file is on the nfsd
> > > filesystem, that lives in fs/nfsd/nfs.c.
> > >
> > > So I bevelive this really is a case of figuring out what we want the
> > > semantics to be for mount and propogating the information down from
> > > mount to where we call the user mode helpers.
> >
> > Oops, I agree. So when I said:
> >
> > The upcalls need to happen consistently in one context for a
> > given virtual nfs server, and that context should probably be
> > derived from rpc.nfsd's somehow.
> >
> > Instead of "rpc.nfsd's", I think I should have said "the mounter of
> > the nfsd filesystem".
> >
> > Which is already how we choose a net namespace: nfsd_mount and
> > nfsd_fill_super store the current net namespace in s_fs_info. (And then
> > grep for "netns" to see the places where that's used.)
> This is going to be mostly a restatement of what's already been said,
> partly for me to refer back to later and partly to clarify and confirm
> what I need to do, so prepare to be bored.
> As a result of Oleg's recommendations and comments, the next version of
> the series will take a reference to an nsproxy and a user namespace
> (from the init process of the calling task, while it's still a child of
> that task), it won't carry around task structs. There are still a couple
> of questions with this so it's not quite there yet.
> We'll have to wait and see if what I've done is enough to remedy Oleg's
> concerns too. LOL, and then there's the question of how much I'll need
> to do to get it to actually work.
> The other difference is obtaining the context (now nsproxy and user
> namspace) has been taken entirely within the usermode helper. I think
> that's a good thing from the calling process isolation requirement. That
> may need to change again based on the discussion here.
> Now we're starting to look at actual usage it's worth keeping in mind
> that how to execute within required namespaces has to be sound before we
> tackle use cases that have requirements over this fundamental
> functionality.
> There are a couple of things to think about.
> One thing that's needed is how to work out if the UMH_USE_NS is needed
> and another is how to provide provide persistent usage of particular
> namespaces across containers. The later probably will relate to the
> origin of the file system (which looks like it will be identified at
> mount time).
> The first case is when the mount originates in the root init namespace
> and most of the time (if not all the time) the UMH_USE_NS doesn't need
> to be set and the helper should run in the root init namspace.

The helper always runs in the original mount's container. Sometimes
that container is the init container, yes, but I don't see what value
there is in setting a flag in that one case.

> That
> should work for mount propagation as well with mounts bound into a
> container.
> Is this also true for automounted mounts at mount point crossing? Or
> perhaps I should ask, should automounted NFS mounts inherit the property
> from their parent mount?

Yes. If we run separate helpers in each container, then the superblocks
should also be separate (so that one container can't poison cached
values used by another). So the containers would all end up with
entirely separate superblocks for the submounts.

That seems inefficient at least, and I don't think it's what an admin
would expect as the default behavior.

> The second case is when the mount originates in another namespace,
> possibly a container. TBH I haven't thought too much about mounts that
> originate from namespaces created by unshare(1) or other source yet. I'm
> hoping that will just work once this is done, ;)

So, one container mounts and spawns a "subcontainer" which continues to
use that filesystem? Yes, I think helpers should continue to run in the
container of the original mount, I don't see any tricky exception here.

> The last time I tried binding NFS mounts from one container into another
> it didn't work,

I'm not sure what you mean by "binding NFS mounts from one container
into another". What exactly didn't work?


> but if we assume that will work at some point then, as
> Bruce points out, we need to provide the ability to record the
> namespaces to be used for subsequent "in namespace" execution while
> maintaining caller isolation (ie. derived from the callers init
> process).
> I've been aware of the need for persistence for a while now and I've
> been thinking about how to do it but I don't have a clear plan quite
> yet. Bruce, having noticed this, has described details about the
> environment I have to work with so that's a start. I need the thoughts
> of others on this too.
> As a result I'm not sure yet if this persistence can be integrated into
> the current implementation or if additional calls will be needed to set
> and clear the namespace information while maintaining the needed
> isolation.
> As Bruce says, perhaps the namespace information should be saved as
> properties of a mount or perhaps it should be a list keyed by some
> handle, the handle being the saved property. I'm not sure yet but the
> later might be unnecessary complication and overhead. The cleanup of the
> namespace information upon summary termination of processes could be a
> bit difficult, but perhaps it will be as simple as making it a function
> of freeing of the object it's stored in (in the cases we have so far
> that would be the mount).
> So, yes, I've still got a fair way to go yet, ;)
> Ian
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at