[PATCH 1/2] binfmt_misc: cleanup on filesystem umount

From: Christian Brauner
Date: Thu Oct 28 2021 - 06:32:22 EST


From: Christian Brauner <christian.brauner@xxxxxxxxxx>

Currently, registering a new binary type pins the binfmt_misc
filesystem. Specifically, this means that as long as there is at least
one binary type registered the binfmt_misc filesystem survives all
umounts, i.e. the superblock is not destroyed. Meaning that a umount
followed by another mount will end up with the same superblock and the
same binary type handlers. This is a behavior we tend to discourage for
any new filesystems (apart from a few special filesystems such as e.g.
configfs or debugfs). A umount operation without the filesystem being
pinned - by e.g. someone holding a file descriptor to an open file -
should usually result in the destruction of the superblock and all
associated resources. This makes introspection easier and leads to
clearly defined, simple and clean semantics. An administrator can rely
on the fact that a umount will guarantee a clean slate making it
possible to reinitialize a filesystem. Right now all binary types would
need to be explicitly deleted before that can happen.

This allows us to remove the heavy-handed calls to simple_pin_fs() and
simple_release_fs() when creating and deleting binary types. This in
turn allows us to replace the current brittle pinning mechanism abusing
dget() which has caused a range of bugs judging from prior fixes in [2]
and [3]. The additional dget() in load_misc_binary() pins the dentry but
only does so for the sake to prevent ->evict_inode() from freeing the
node when a user removes the binary type and kill_node() is run. Which
would mean ->interpreter and ->interp_file would be freed causing a UAF.

This isn't really nicely documented nor is it very clean because it
relies on simple_pin_fs() pinning the filesystem as long as at least one
binary type exists. Otherwise it would cause load_misc_binary() to hold
on to a dentry belonging to a superblock that has been shutdown.
Replace that implicit pinning with a clean and simple per-node refcount
and get rid of the ugly dget() pinning. A similar mechanism exists for
e.g. binderfs (cf. [4]). All the cleanup work can now be done in
->evict_inode().

In a follow-up patch we will make it possible to use binfmt_misc in
sandboxes. We will use the cleaner semantics where a umount for the
filesystem will cause the superblock and all resources to be
deallocated. In preparation for this apply the same semantics to the
initial binfmt_misc mount. Note, that this is a user-visible change and
as such a uapi change but one that we can reasonably risk. We've
discussed this in earlier versions of this patchset (cf. [1]).

The main user and provider of binfmt_misc is systemd. Systemd provides
binfmt_misc via autofs since it is configurable as a kernel module and
is used by a few exotic packages and users. As such a binfmt_misc mount
is triggered when /proc/sys/fs/binfmt_misc is accessed and is only
provided on demand. Other autofs on demand filesystems include EFI ESP
which systemd umounts if the mountpoint stays idle for a certain amount
of time. This doesn't apply to the binfmt_misc autofs mount which isn't
touched once it is mounted meaning this change can't accidently wipe
binary type handlers without someone having explicitly unmounted
binfmt_misc. After speaking to systemd folks they don't expect this
change to affect them.

In line with our general policy, if we see a regression for systemd or
other users with this change we will switch back to the old behavior for
the initial binfmt_misc mount and have binary types pin the filesystem
again. But while we touch this code let's take the chance and let's
improve on the status quo.

[1]: https://lore.kernel.org/r/20191216091220.465626-2-laurent@xxxxxxxxx
[2]: commit 43a4f2619038 ("exec: binfmt_misc: fix race between load_misc_binary() and kill_node()"
[3]: commit 83f918274e4b ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
[4]: commit f0fe2c0f050d ("binder: prevent UAF for binderfs devices II")
Cc: Sargun Dhillon <sargun@xxxxxxxxx>
Cc: Serge Hallyn <serge@xxxxxxxxxx>
Cc: Jann Horn <jannh@xxxxxxxxxx>
Cc: Henning Schild <henning.schild@xxxxxxxxxxx>
Cc: Andrei Vagin <avagin@xxxxxxxxx>
Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
Cc: Laurent Vivier <laurent@xxxxxxxxx>
Cc: linux-fsdevel@xxxxxxxxxxxxxxx
Signed-off-by: Christian Brauner <christian.brauner@xxxxxxxxxx>
---
fs/binfmt_misc.c | 56 +++++++++++++++++++++++++++++++-----------------
1 file changed, 36 insertions(+), 20 deletions(-)

diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index e1eae7ea823a..5a9d5e44c750 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -60,12 +60,11 @@ typedef struct {
char *name;
struct dentry *dentry;
struct file *interp_file;
+ refcount_t ref;
} Node;

static DEFINE_RWLOCK(entries_lock);
static struct file_system_type bm_fs_type;
-static struct vfsmount *bm_mnt;
-static int entry_count;

/*
* Max length of the register string. Determined by:
@@ -126,6 +125,16 @@ static Node *check_file(struct linux_binprm *bprm)
return NULL;
}

+/* Free node if we are sure load_misc_binary() is done with it. */
+static void put_node(Node *e)
+{
+ if (refcount_dec_and_test(&e->ref)) {
+ if (e->flags & MISC_FMT_OPEN_FILE)
+ filp_close(e->interp_file, NULL);
+ kfree(e);
+ }
+}
+
/*
* the loader itself
*/
@@ -142,8 +151,9 @@ static int load_misc_binary(struct linux_binprm *bprm)
/* to keep locking time low, we copy the interpreter string */
read_lock(&entries_lock);
fmt = check_file(bprm);
+ /* Make sure the node isn't freed behind our back. */
if (fmt)
- dget(fmt->dentry);
+ refcount_inc(&fmt->ref);
read_unlock(&entries_lock);
if (!fmt)
return retval;
@@ -198,7 +208,16 @@ static int load_misc_binary(struct linux_binprm *bprm)

retval = 0;
ret:
- dput(fmt->dentry);
+
+ /*
+ * If we actually put the node here all concurrent calls to
+ * load_misc_binary() will have finished. We also know
+ * that for the refcount to be zero ->evict_inode() must have removed
+ * the node to be deleted from the list. All that is left for us is to
+ * close and free.
+ */
+ put_node(fmt);
+
return retval;
}

@@ -557,26 +576,29 @@ static void bm_evict_inode(struct inode *inode)
{
Node *e = inode->i_private;

- if (e && e->flags & MISC_FMT_OPEN_FILE)
- filp_close(e->interp_file, NULL);
-
clear_inode(inode);
- kfree(e);
+
+ if (e) {
+ write_lock(&entries_lock);
+ list_del_init(&e->list);
+ write_unlock(&entries_lock);
+ put_node(e);
+ }
}

static void kill_node(Node *e)
{
struct dentry *dentry;

- write_lock(&entries_lock);
- list_del_init(&e->list);
- write_unlock(&entries_lock);
-
+ /*
+ * It's fine to unconditionally drop the dentry since ->evict_inode()
+ * will check the refcount before freeing the node and so it can't go
+ * away behind load_misc_binary()'s back.
+ */
dentry = e->dentry;
drop_nlink(d_inode(dentry));
d_drop(dentry);
dput(dentry);
- simple_release_fs(&bm_mnt, &entry_count);
}

/* /<entry> */
@@ -683,13 +705,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
if (!inode)
goto out2;

- err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
- if (err) {
- iput(inode);
- inode = NULL;
- goto out2;
- }
-
+ refcount_set(&e->ref, 1);
e->dentry = dget(dentry);
inode->i_private = e;
inode->i_fop = &bm_entry_operations;

base-commit: 3906fe9bb7f1a2c8667ae54e967dc8690824f4ea
--
2.30.2