Re: [PATCH 0/5 RFC] Add an interface to discover relationships between namespaces

From: Andrew Vagin
Date: Tue Aug 02 2016 - 05:49:22 EST


On Fri, Jul 29, 2016 at 01:05:48PM -0500, Eric W. Biederman wrote:
> "Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes:
>
> > Hi Eric,
> >
> > On 07/28/2016 02:56 PM, Eric W. Biederman wrote:
> >> "Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes:
> >>
> >>> On 07/26/2016 10:39 PM, Andrew Vagin wrote:
> >>>> On Tue, Jul 26, 2016 at 09:17:31PM +0200, Michael Kerrisk (man-pages) wrote:
> >>
> >>>> If we want to compare two file descriptors of the current process,
> >>>> it is one of cases for which kcmp can be used. We can call kcmp to
> >>>> compare two namespaces which are opened in other processes.
> >>>
> >>> Is there really a use case there? I assume we're talking about the
> >>> scenario where a process in one namespace opens a /proc/PID/ns/*
> >>> file descriptor and passes that FD to another process via a UNIX
> >>> domain socket. Is that correct?
> >>>
> >>> So, supposing that we want to build a map of the relationships
> >>> between namespaces using the proposed kcmp() API, and there are
> >>> say N namespaces? Does this mena we make (N * (N-1) / 2) calls
> >>> to kcmp()?
> >>
> >> Potentially. The numbers are small enough O(N^2) isn't fatal.
> >
> > Define "small", please.
> >
> > O(N^2) makes me nervous about what other use cases lurk out
> > there that may get bitten by this.
>
> Worst case for N (One namespace per thread) is about 60k.
> A typical heavy use case may be 1000 namespaces of any type.
> So we are talking about O(N^2) that rarely happens and should be done in
> a couple of seconds.
>
> >> Where kcmp shines is that it allows migration to happen. Inode numbers
> >> to change (which they very much will today), and still have things work.
> >
> >
> >> We can keep it O(Nlog(N)) by taking advantage of not just the equality
> >> but the ordering relationship. Although Ugh.
> >
> > Yes, that sounds pretty ugly...
>
> Actually having thought about this a little more if kcmp returns an
> ordering by inode and migration preserves the relative order of
> the inodes (which should just be a creation order) it should be quite
> solvable.
>
> Switch from an order by inode number to an order by object creation
> time, and guarantee that all creations are have an order (which with
> task_list_lock we practically already have) and it should be even easier
> to create. (A 64bit nanosecond resolution timestamp is good for 544
> years of uptime). A 64bit number that increments each time an object is
> created should have an even better lifespan.
>
> I don't know if we can find a way to give that guarantee for other kcmp
> comparisons but it is worth a thought.
>
> >>One disadvantage of
> >> kcmp currently is that the way the ordering relationship is defined
> >> the order is not preserved over migration :(
> >
> > So, does kcmp() fully solve the proble(s) at hand? It sounds like
> > not, if I understand your last point correctly.
>
> There are 3 possibilities I see for migration in migration, ordered
> in order of implementation difficulty.
> 1) Have a clear signal that migration happened and a nested migration
> needs to restart.
> 2) Use kcmp so that only the relative order needs to be preserved.
> 3) Preserve the device number and inode numbers.
>
> At a practical level I think (2) may actually in net be the simplest.
> It requires a little more care to implement and you have to opt in,
> but it should not require any rolling back of activity (merely careful
> ordering of object creation).
>
> I definititely like kcmp knowing how to compare things by inode
> (aka st_dev, st_inode) because then even if you have to restart
> the comparisons after a migration the exact details you are comparing
> are hidden and so it is easier to support and harder to get wrong.
>
> I can imagine how to preserve inode numbers by creating a new instance
> of nsfs instance and using the old inode numbers upon restore. I don't
> currently see how we could possibly preserve st_dev over migration short of
> a device number namespace.

I think we can avoid comparing st_dev if we will compare inode numbers
for parent user namespaces.

Namespaces looks like a tree where user-namespaces are directories and
other namespaces are files.

A namespace can be described by a path in this imaginary file system,
which looks like /userns1/userns2/XXXns.

In this case we need to guarantee uniq names inside each directories and
that they will be not changed over migration.

>
> So if we are going to continue with making device numbers be a legacy
> attribute applications should not care about we need a way to compare
> things by not looking at st_dev. Which brings us back to kcmp.
>
> Hmm. Hotplugging as disk and plugging it back likely will change the
> device number and give the same kind of challenge with st_dev (although
> you can't keep a file descriptor open across that kind of event). So
> certainly a hotplug event on a device should be enough to say don't care
> about the device number.
>
> Eric
>