Re: [RFC] Expose request_module via syscall

From: Christian Brauner
Date: Wed Sep 22 2021 - 08:25:33 EST


On Mon, Sep 20, 2021 at 11:36:47AM -0700, Andy Lutomirski wrote:
> On Mon, Sep 20, 2021 at 11:16 AM Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote:
> >
> > On Mon, Sep 20, 2021 at 04:51:19PM +0200, Thomas Weißschuh wrote:
>
> > > > Do you mean it literally invokes /sbin/modprobe? If so, hooking this
> > > > at /sbin/modprobe and calling out to the container manager seems like
> > > > a decent solution.
> > >
> > > Yes it does. Thanks for the idea, I'll see how this works out.
> >
> > Would documentation guiding you in that way have helped? If so
> > I welcome a patch that does just that.
>
> If someone wants to make this classy, we should probably have the
> container counterpart of a standardized paravirt interface. There
> should be a way for a container to, in a runtime-agnostic way, issue
> requests to its manager, and requesting a module by (name, Linux
> kernel version for which that name makes sense) seems like an
> excellent use of such an interface.

I always thought of this in two ways we currently do this:

1. Caller transparent container manager requests.
This is the seccomp notifier where we transparently handle syscalls
including intercepting init_module() where we parse out the module to
be loaded from the syscall args of the container and if it is
allow-listed load it for the container otherwise continue the syscall
letting it fail or failing directly through seccomp return value.

2. A process in the container explicitly calling out to the container
manager.
One example how this happens is systemd-nspawn via dbus messages
between systemd in the container and systemd outside the container to
e.g. allocate a new terminal in the container (kinda insecure but
that's another issue) or other stuff.

So what was your idea: would it be like a device file that could be
exposed to the container where it writes requestes to the container
manager? What would be the advantage to just standardizing a socket
protocol which is what we do for example (it doesn't do module loading
of course as we handle that differently):

## Container to host communication
LXD sets up a socket at `/dev/lxd/sock` which root in the container can
use to communicate with LXD on the host.

In LXD, this feature is implemented through a /dev/lxd/sock node which
is created and setup for all LXD instances.

This file is a Unix socket which processes inside the instance can
connect to. It's multi-threaded so multiple clients can be connected at
the same time.

Implementation details
LXD on the host binds /var/lib/lxd/devlxd/sock and starts listening for
new connections on it.

This socket is then exposed into every single instance started by LXD at
/dev/lxd/sock.

The single socket is required so we can exceed 4096 instances,
otherwise, LXD would have to bind a different socket for every instance,
quickly reaching the FD limit.

Authentication
Queries on /dev/lxd/sock will only return information related to the
requesting instance. To figure out where a request comes from, LXD will
extract the initial socket ucred and compare that to the list of
instances it manages.

Protocol
The protocol on /dev/lxd/sock is plain-text HTTP with JSON messaging, so
very similar to the local version of the LXD protocol.

Unlike the main LXD API, there is no background operation and no
authentication support in the /dev/lxd/sock API.

Christian