Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

From: Benjamin Coddington
Date: Mon Feb 23 2015 - 20:22:27 EST



On Tue, 24 Feb 2015, Ian Kent wrote:

> On Mon, 2015-02-23 at 09:52 -0500, J. Bruce Fields wrote:
> > On Sat, Feb 21, 2015 at 11:58:58AM +0800, Ian Kent wrote:
> > > On Fri, 2015-02-20 at 14:05 -0500, J. Bruce Fields wrote:
> > > > On Fri, Feb 20, 2015 at 12:07:15PM -0600, Eric W. Biederman wrote:
> > > > > "J. Bruce Fields" <bfields@xxxxxxxxxxxx> writes:
> > > > >
> > > > > > On Fri, Feb 20, 2015 at 05:33:25PM +0800, Ian Kent wrote:
> > > > >
> > > > > >> The case of nfsd state-recovery might be similar but you'll need to help
> > > > > >> me out a bit with that too.
> > > > > >
> > > > > > Each network namespace can have its own virtual nfs server. Servers can
> > > > > > be started and stopped independently per network namespace. We decide
> > > > > > which server should handle an incoming rpc by looking at the network
> > > > > > namespace associated with the socket that it arrived over.
> > > > > >
> > > > > > A server is started by the rpc.nfsd command writing a value into a magic
> > > > > > file somewhere.
> > > > >
> > > > > nit. Unless I am completely turned around that file is on the nfsd
> > > > > filesystem, that lives in fs/nfsd/nfs.c.
> > > > >
> > > > > So I bevelive this really is a case of figuring out what we want the
> > > > > semantics to be for mount and propogating the information down from
> > > > > mount to where we call the user mode helpers.
> > > >
> > > > Oops, I agree. So when I said:
> > > >
> > > > The upcalls need to happen consistently in one context for a
> > > > given virtual nfs server, and that context should probably be
> > > > derived from rpc.nfsd's somehow.
> > > >
> > > > Instead of "rpc.nfsd's", I think I should have said "the mounter of
> > > > the nfsd filesystem".
> > > >
> > > > Which is already how we choose a net namespace: nfsd_mount and
> > > > nfsd_fill_super store the current net namespace in s_fs_info. (And then
> > > > grep for "netns" to see the places where that's used.)
> > >
> > > This is going to be mostly a restatement of what's already been said,
> > > partly for me to refer back to later and partly to clarify and confirm
> > > what I need to do, so prepare to be bored.
> > >
> > > As a result of Oleg's recommendations and comments, the next version of
> > > the series will take a reference to an nsproxy and a user namespace
> > > (from the init process of the calling task, while it's still a child of
> > > that task), it won't carry around task structs. There are still a couple
> > > of questions with this so it's not quite there yet.
> > >
> > > We'll have to wait and see if what I've done is enough to remedy Oleg's
> > > concerns too. LOL, and then there's the question of how much I'll need
> > > to do to get it to actually work.
> > >
> > > The other difference is obtaining the context (now nsproxy and user
> > > namspace) has been taken entirely within the usermode helper. I think
> > > that's a good thing from the calling process isolation requirement. That
> > > may need to change again based on the discussion here.
> > >
> > > Now we're starting to look at actual usage it's worth keeping in mind
> > > that how to execute within required namespaces has to be sound before we
> > > tackle use cases that have requirements over this fundamental
> > > functionality.
> > >
> > > There are a couple of things to think about.
> > >
> > > One thing that's needed is how to work out if the UMH_USE_NS is needed
> > > and another is how to provide provide persistent usage of particular
> > > namespaces across containers. The later probably will relate to the
> > > origin of the file system (which looks like it will be identified at
> > > mount time).
> > >
> > > The first case is when the mount originates in the root init namespace
> > > and most of the time (if not all the time) the UMH_USE_NS doesn't need
> > > to be set and the helper should run in the root init namspace.
> >
> > The helper always runs in the original mount's container. Sometimes
> > that container is the init container, yes, but I don't see what value
> > there is in setting a flag in that one case.
>
> Yep. that's pretty much what I meant.
>
> >
> > > That
> > > should work for mount propagation as well with mounts bound into a
> > > container.
> > >
> > > Is this also true for automounted mounts at mount point crossing? Or
> > > perhaps I should ask, should automounted NFS mounts inherit the property
> > > from their parent mount?
> >
> > Yes. If we run separate helpers in each container, then the superblocks
> > should also be separate (so that one container can't poison cached
> > values used by another). So the containers would all end up with
> > entirely separate superblocks for the submounts.
>
> That's almost what I was thinking.
>
> The question relates to a mount for which the namespace proxy would have
> been set at mount time in a container and then bound into another
> container (in Docker by using the "--volumes-from <name>"). I believe
> the namespace information from the original mount should always be used
> when calling a usermode helper. This might not be a sensible question
> now but I think it needs to be considered.
>
> >
> > That seems inefficient at least, and I don't think it's what an admin
> > would expect as the default behavior.
>
> LOL, but the best way to manage this is to set the namespace information
> at mount time (as Eric mentioned long ago) and use that everywhere. It's
> consistent and it provides a way for a process with appropriate
> privilege to specify the namespace information.
>
> >
> > > The second case is when the mount originates in another namespace,
> > > possibly a container. TBH I haven't thought too much about mounts that
> > > originate from namespaces created by unshare(1) or other source yet. I'm
> > > hoping that will just work once this is done, ;)
> >
> > So, one container mounts and spawns a "subcontainer" which continues to
> > use that filesystem? Yes, I think helpers should continue to run in the
> > container of the original mount, I don't see any tricky exception here.
>
> That's what I think should happen too.
>
> >
> > > The last time I tried binding NFS mounts from one container into another
> > > it didn't work,
> >
> > I'm not sure what you mean by "binding NFS mounts from one container
> > into another". What exactly didn't work?
>
> It's the volumes-from Docker option I'm thinking of.
> I'm not sure now if my statement is accurate. I'll need to test it
> again. I thought I had but what didn't work with the volumes-from might
> have been autofs not NFS mounts.
>
> Anyway, I'm going to need to provide a way for clients to say "calculate
> the namespace information and give me an identifier so it can be used
> everywhere for this mount" which amounts to maintaining a list of the
> namespace objects.

That sounds a lot closer to some of the work I've been doing to see if I can
come up with a way to solve the "where's the namespace I need?" problem.

I agree with Greg's very early comments that the easiest way to determine
which namespace context a process should use is to keep it as a copy of
the task -- and the place that copy should be done is fork(). The
problem was where to keep that information and how to make it reusable.

I've been hacking out a keyrings-based "key-agent" service that is basically
a special type of key (like a keyring). A key_agent type roughly
corresponds to a particular type of upcall user, such as the idmapper. A
key_agent_type is registered, and that registration ties a particular
key_type to that key_agent. When a process calls request_key() for that
key_type instead of using the helper to execute /sbin/request-key the
process' keyrings are searched for a key_agent. If a key_agent isn't found,
the key_agent provider is then asked to provide an existing one based on
some rules (is there an existing key_agent running in a different namespace
that we might want to use for this purpose -- for example: is there there
one already running in the namespace where the mount occurred). If so, it
is linked to the calling process' keyrings and then used for the upcall. If
not, then the calling process itself is forked/execve-ed into a new
persistent key_agent that is installed on the calling process' keyrings just
like a key, and with the same lifetime and GC expectations of a key.

A key_agent is a user-space process waiting for a realtime signal to process a
particular key and provide the requested key information that can be
installed back onto the calling process' keyrings.

Basically, this approach allows a particular user of a keyrings-based upcall
to specify their own rules about how to provide a namespace context for a
calling process. It does, however, require extra work to create a specific
key_agent_type for each individual key_type that might want to use this
mechanism.

I've been waiting to have a bit more of a proof-of-concept before bringing
this approach into the discussion. However, it looks like it may be
important to allow particular users of the upcall their own rules about
which namespace contexts they might want to use. This approach could
provide that flexibility.

Ben



> I'm not sure yet if I should undo some of what I've done recently or
> leave it for users who need a straight "execute me now within the
> current namespace".
>
> >
> > --b.
> >
> > > but if we assume that will work at some point then, as
> > > Bruce points out, we need to provide the ability to record the
> > > namespaces to be used for subsequent "in namespace" execution while
> > > maintaining caller isolation (ie. derived from the callers init
> > > process).
> > >
> > > I've been aware of the need for persistence for a while now and I've
> > > been thinking about how to do it but I don't have a clear plan quite
> > > yet. Bruce, having noticed this, has described details about the
> > > environment I have to work with so that's a start. I need the thoughts
> > > of others on this too.
> > >
> > > As a result I'm not sure yet if this persistence can be integrated into
> > > the current implementation or if additional calls will be needed to set
> > > and clear the namespace information while maintaining the needed
> > > isolation.
> > >
> > > As Bruce says, perhaps the namespace information should be saved as
> > > properties of a mount or perhaps it should be a list keyed by some
> > > handle, the handle being the saved property. I'm not sure yet but the
> > > later might be unnecessary complication and overhead. The cleanup of the
> > > namespace information upon summary termination of processes could be a
> > > bit difficult, but perhaps it will be as simple as making it a function
> > > of freeing of the object it's stored in (in the cases we have so far
> > > that would be the mount).
> > >
> > > So, yes, I've still got a fair way to go yet, ;)
> > >
> > > Ian
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/