Re: [RFC PATCH 0/2] Loop device psuedo filesystem
From: Michael H. Warfield
Date: Wed May 28 2014 - 13:45:31 EST
On Wed, 2014-05-28 at 09:10 -0700, Andy Lutomirski wrote:
> On Wed, May 28, 2014 at 12:32 AM, Seth Forshee
> <seth.forshee@xxxxxxxxxxxxx> wrote:
> > On Tue, May 27, 2014 at 03:19:15PM -0700, Andy Lutomirski wrote:
> >> On Tue, May 27, 2014 at 2:58 PM, Seth Forshee
> >> <seth.forshee@xxxxxxxxxxxxx> wrote:
> >> > I'm posting these patches in response to the ongoing discussion of loop
> >> > devices in containers at [1].
> >> >
> >> > The patches implement a psuedo filesystem for loop devices, which will
> >> > allow use of loop devices in containters using standard utilities. Under
> >> > normal use a loopfs mount will initially contain a single device node
> >> > for loop-control which can be used to request and release loop devices.
> >> > Any devices allocated via this node will automatically appear in that
> >> > loopfs mount (and in devtmpfs) but not in any other loopfs mounts.
> >> > CAP_SYS_ADMIN in the userns of the process which performed the mount is
> >> > allowed to perform privileged loop ioctls on these devices.
> >> >
> >> > Alternately loopfs can be mounted with the hostmount option, intended
> >> > for mounting /dev/loop in the host. This is the default mount for any
> >> > devices not created via loop-control in a loopfs mount (e.g. devices
> >> > created during driver init, devices created via /dev/loop-control, etc).
> >> > This is only available to system-wide CAP_SYS_ADMIN.
> >> >
> >> > I still have some testing to do on these patches, but they work at
> >> > minimum for simple use cases. It's possible to use an unmodified losetup
> >> > if it's new enough to know about loop-control, with a couple of caveats:
> >> >
> >> > * /dev/loop-control must be symlinked to /dev/loop/loop-control
> >> > * In some cases losetup attempts to use /dev/loopN when the device node
> >> > is at /dev/loop/N. For example, 'losetup -f disk.img' fails.
> >> >
> >> > Device nodes for loop partitions are not created in loopfs. These
> >> > devices are created by the generic block layer, and the loop driver has
> >> > no way of knowing when they are created, so some kind of hook into the
> >> > driver will be needed to support this.
> >>
> >> This is entertaining and a bit terrifying :)
> >>
> >> ISTM that what you've done is to create a way for per-userns devices
> >> to live in a special filesystem and for userns containers to
> >> instantiate those devices by offloading all the hard work to the
> >> kernel.
> >>
> >> What if we generalized this?
> >>
> >> For example, we could add a concept of ephemeral devices. An
> >> ephemeral device is a device that can be referenced by an inode with a
> >> guarantee that the inode will *never* accidentally point to a
> >> different device [1]. Then we add a concept of the userns that owns a
> >> struct device.
> >>
> >> To make this safe, we'll need to make sure that old host udev will not
> >> see non-init-userns devices, ever. This is easy enough to do, but
> >> doing it elegantly might take some design work.
> >
> > To do this wouldn't we need a generic way to know which namespace a
> > device goes with? Greg has clearly stated that he doesn't want to do
> > this.
> This is IMO silly. If Greg doesn't want any kind of namespaces in the
> device core, then sticking considerably more complicated namespaces
> into the *loop* driver is just absurd.
Maybe so, maybe no, but it is what it is. Greg K-H has been very clear
and emphatic on this topic. He made it clear at LinuxPlumbers in NOLA
last year and he made it clear in this thread. He did admit to some use
cases which several of us presented and he did say he would be open to
patches in this limited case, which is what Seth is presenting. This is
working within the confines he has defined. We'll take what we can get.
> >> To make this useful, we'll need a way for things inside user
> >> namespaces to create the device nodes. I can imagine at least three
> >> ways to make this work.
> >>
> >> a) Allow mknod on a tmpfs created by a particular userns to succeed if
> >> the targetting struct device is owned by that userns or a child and if
> >> the caller is ns_capable(CAP_MKNOD).
> >> b) Create a new filesystem that has some special ioctl or whatever to do it.
> >> c) Have real per-user-ns devtmpfs.
> >>
> >> Now, to get loop working in a userns, we need a way for the userns (or
> >> the host!) to create a new loop-control device owned by that userns
> >> and we need to tweak the loop driver to make the created loop devices
> >> be owned by the userns.
> >
> > The patches I posted previously more or less did this using per-ns
> > devtmpfs, aside from the ephimeral part. The feedback was "just do it in
> > loop," so I sent these to facilitate discussing this option with
> > something concrete. I personally still like the per-ns devtmpfs
> > approach, but that's been nacked.
> The ephemeral part might not be needed using devtmpfs if devtmpfs can
> guarantee that the device nodes go away if the device goes away. I
> don't know whether it can make that guarantee.
> > (a) might be interesting, but I'd expect the same objections to be
> > raised as for (c). And it seems to me that (b) is just a alternate
> > interface for (a).
> True.
> >> (Note: I'm deliberately ignoring the fact that just doing this for
> >> loop seems to be almost entirely useless right now: you still can't
> >> mount the things.)
> > You could also argue that it's useless to be able to mount things if you
> > have no block device on which to mount them. We have to start somewhere.
> True.
> But if we take this particular route, then I can imagine a real mess
> when someone wants to mount a non-loop device, and we get stuck on how
> to expose the device node. Sigh.
Then we deal with that horse when we have to make him sing. One way or
the other, we're trying to moving forward.
> --Andy
Regards,
Mike
--
Michael H. Warfield (AI4NB) | (770) 978-7061 | mhw@xxxxxxxxxxxx
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!
Attachment:
signature.asc
Description: This is a digitally signed message part