Re: [PATCH 0/8] loopfs

From: StÃphane Graber
Date: Wed Apr 08 2020 - 12:41:59 EST


On Wed, Apr 8, 2020 at 12:24 PM Jann Horn <jannh@xxxxxxxxxx> wrote:
>
> On Wed, Apr 8, 2020 at 5:23 PM Christian Brauner
> <christian.brauner@xxxxxxxxxx> wrote:
> > One of the use-cases for loopfs is to allow to dynamically allocate loop
> > devices in sandboxed workloads without exposing /dev or
> > /dev/loop-control to the workload in question and without having to
> > implement a complex and also racy protocol to send around file
> > descriptors for loop devices. With loopfs each mount is a new instance,
> > i.e. loop devices created in one loopfs instance are independent of any
> > loop devices created in another loopfs instance. This allows
> > sufficiently privileged tools to have their own private stash of loop
> > device instances. Dmitry has expressed his desire to use this for
> > syzkaller in a private discussion. And various parties that want to use
> > it are Cced here too.
> >
> > In addition, the loopfs filesystem can be mounted by user namespace root
> > and is thus suitable for use in containers. Combined with syscall
> > interception this makes it possible to securely delegate mounting of
> > images on loop devices, i.e. when a user calls mount -o loop <image>
> > <mountpoint> it will be possible to completely setup the loop device.
> > The final mount syscall to actually perform the mount will be handled
> > through syscall interception and be performed by a sufficiently
> > privileged process. Syscall interception is already supported through a
> > new seccomp feature we implemented in [1] and extended in [2] and is
> > actively used in production workloads. The additional loopfs work will
> > be used there and in various other workloads too. You'll find a short
> > illustration how this works with syscall interception below in [4].
>
> Would that privileged process then allow you to mount your filesystem
> images with things like ext4? As far as I know, the filesystem
> maintainers don't generally consider "untrusted filesystem image" to
> be a strongly enforced security boundary; and worse, if an attacker
> has access to a loop device from which something like ext4 is mounted,
> things like "struct ext4_dir_entry_2" will effectively be in shared
> memory, and an attacker can trivially bypass e.g.
> ext4_check_dir_entry(). At the moment, that's not a huge problem (for
> anything other than kernel lockdown) because only root normally has
> access to loop devices.
>
> Ubuntu carries an out-of-tree patch that afaik blocks the shared
> memory thing: <https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/eoan/commit?id=4bc428fdf5500b7366313f166b7c9c50ee43f2c4>
>
> But even with that patch, I'm not super excited about exposing
> filesystem image parsing attack surface to containers unless you run
> the filesystem in a sandboxed environment (at which point you don't
> need a loop device anymore either).

So in general we certainly agree that you should never expose someone
that you wouldn't trust with root on the host to syscall interception
mounting of real kernel filesystems.

But that's not all that our syscall interception logic can do. We have
support for rewriting a normal filesystem mount attempt to instead use
an available FUSE implementation. As far as the user is concerned,
they ran "mount /dev/sdaX /mnt" and got that ext4 filesystem mounted
on /mnt as requested, except that the container manager intercepted
the mount attempt and instead spawned fuse2fs for that mount. This
requires absolutely no change to the software the user is running.

loopfs, with that interception mode, will let us also handle all cases
where a loop would be used, similarly without needing any change to
the software being run. If a piece of software calls the command
"mount -o loop blah.img /mnt", the "mount" command will setup a loop
device as it normally would (doing so through loopfs) and then will
call the "mount" syscall, which will get intercepted and redirected to
a FUSE implementation if so configured, resulting in the expected
filesystem being mounted for the user.

LXD with syscall interception offers both straight up privileged
mounting using the kernel fs or using a FUSE based implementation.
This is configurable on a per-filesystem and per-container basis.

I hope that clarifies what we're doing here :)

StÃphane