Re: Feature request, "create on mount" to create mount point directory on mount, implied remove on unmount

From: jon
Date: Sun Jul 05 2015 - 19:36:36 EST


On Sun, 2015-07-05 at 18:39 +0100, Al Viro wrote:
> On Sun, Jul 05, 2015 at 04:46:50PM +0100, jon wrote:
>
> > I should have titled it "Feature request from a simple minded user"
> >
> > I have not the slightest idea what you are talking about.
> >
> > When I learnt *nix it did not have "name spaces" in reference to process
> > tables. I understand the theory of VM a bit, the model in my mind each
> > "machine", be that one kernel on a true processor or a VM instance has
> > "a process table" and "a file descriptor table" etc - anything more is
> > beyond my current level of knowledge.
>
> File descriptor table isn't something system-wide - it belongs to a process...
Ok, true... I guess it is not DOS or CP/M ;-)

>
> Containers are basically glorified process groups.
>
> Anyway, the underlying model hasn't changed much since _way_ back; each
> thread of execution is a virtual machine of its own, with actual CPUs
> switched between those.
Ok, not sure I quite follow. What do you mean virtual machine ?
My understanding was that a true VM has a hypervisor and I though also
required some extra processor instructions to basically do an "outer"
context switch (and some memory fiddling to fake up unqique address
spaces) while the operating systems within the VMs own scheduler is
doing the "inner" context switch (IE push/pop all on Intel style CPU).
Not all architectures have any VM capability.
Are you talking about kernels on Intel with SMP enabled only ?

> Each of them has memory, ports (== file descriptors)
> and traps (== signal handlers). The main primitives are
> clone() (== rfork() in other branches; plain fork() is just the most
> common case) - create a copy of the virtual machine, in the state identical
> to that of caller with the exception of different return values given to
> child and parent.
> exit() - terminate the virtual machine
> execve() - load a new program
Ok, I think I follow that.

> Parts of those virtual machines can be shared - e.g. you can have descriptor
> table not just identical to that of parent at the time of clone(), but
> actually shared with it, so e.g. open() in child makes the resulting descriptor
> visible to parent as well.
Ok, I follow you. I often dont need anything more complex than fork(),
when I thread I use pthreads so have not dug around into what is
actually happening at the kernel level. I was not aware that the parent
could see file descriptors created by the child, is this always true or
only true if the parent and child are explicitly a shared memory
process.

> Or you can have memory (address space) shared,
> so that something like mmap() in parent would affect the memory mappings of
> child, etc. Which components are to be shared and which - copied is selected
> by clone() argument.
OK.
I have used that to create parent child processes with shared memory,
but I did cut&paste the initial code from a googled example rather an
apply any true skill ;-)

> unshare() allows to switch to using a private copy of chosen components
> - e.g. you might say "from now on, I want my file descriptor table to be
> private". In e.g. Plan 9 that's expressed via rfork() as well.
unshare() is new to me but I see the logic.


> Less obvious componets including current directory and root. Normally, these
> are not shared; chdir() done in child won't affect the parent and vice versa.
> You could ask them to be shared, though - for multithreaded program it could
> be convenient.
OK.

>
> Different processes might see different parts of the mount tree since v7 had
> introduced chroot(2). Namespaces simply allow to have a *forest* - different
> groups of processes seeing different mount trees in that forest. The same
> filesystem may be mounted in many places, and the same directory might be
> a mountpoint in an instance visible to one process and not a mountpoint
> in an instance visible to another (or a mountpoint with something entirely
> different mounted in an instance visible to somebody else).
Ok, I follow that. I have used chroot but only very sparingly, I have
never used a machine (to my knowledge) with the same file system mounted
onto multiple mount points so I had not considered that.

> Mount tree is yet another component; the difference is that normally it *is*
> shared on clone(), rather than being copied. I.e. mount() done by child
> affects the mount tree visible to parent. But you still can ask for
> a new private copy of mount tree via clone() or unshare(). When the
> last process sharing that mount tree exits, it gets dissolved, same as
> every file descriptor in a descriptor table gets closed when the last
> thread sharing that descriptor table exits (or asks for unshared copy of
> descriptor table, e.g. as a side effect of execve()). Just as with
> file descriptors close() does not necessary close the opened file
> descriptor's connected to (that happens only when all descriptors connected
> to given opened file are closed), umount() does not necessary shut the
> filesystem down; that happens only if it's not mounted elsewhere.
Ok, I follow that :-) But logically it must be done with two functions
or handlers or something, so I would assume that my proposed "remove
mount directory" would simply hang off whatever call truly discards the
file system from the kernel.

I thought code for my feature might need to generate a warning if the
mount point has files in it (IE rmdir fails on unmount) or if the mount
point exists in some read only part of the directory tree. I figured a
few lines of code and couple of kernel warning would be enough. I get
from your explanation that things are a little more complex than maybe I
thought.

> With something like Plan 9 that would be pretty much all you need for
> isolating process groups into separate environments - just give each
> the set of filesystems they should be seeing and be done with that.
> We, unfortunately, can't drop certain FPOS APIs (starting with sockets,
> with their "network interfaces are magical sets of named objects, names
> are not experssed as pathnames, access control and visibility completely
> ad-hoc, ditto for listing and renaming" shite), so we get more
> state components ;-/ Which leads to e.g. "network namespace" and similar
> complications; that crap should've been dealt with in _filesystem_ namespace,
> but Occam Razor be damned, we need to support every misdesigned interface
> that got there, no matter how many entities it breeds and how convoluted
> the result becomes... In principle, though, it's still the same model -
> only with more components to be possibly shared.
OK, thanks for the explanation. I have never looked at plan 9, I put it
in the same camp as Hurd - something that is interesting in theory but
that I will probably never live to see running on anything that I
use ;-)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/