Re: [RFC PATCH v2 8/9] namespace: add sb_revalidate_bindmounts helper

From: Christian Brauner
Date: Mon Apr 03 2023 - 17:14:34 EST


On Mon, Apr 03, 2023 at 04:45:16PM +0200, Alexander Mikhalitsyn wrote:
> Useful if for some reason bindmounts root dentries get invalidated
> but it's needed to revalidate existing bindmounts without remounting.
>
> Cc: Miklos Szeredi <mszeredi@xxxxxxxxxx>
> Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
> Cc: Amir Goldstein <amir73il@xxxxxxxxx>
> Cc: Stéphane Graber <stgraber@xxxxxxxxxx>
> Cc: Seth Forshee <sforshee@xxxxxxxxxx>
> Cc: Christian Brauner <brauner@xxxxxxxxxx>
> Cc: Andrei Vagin <avagin@xxxxxxxxx>
> Cc: Pavel Tikhomirov <ptikhomirov@xxxxxxxxxxxxx>
> Cc: Bernd Schubert <bschubert@xxxxxxx>
> Cc: linux-fsdevel@xxxxxxxxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx
> Cc: criu@xxxxxxxxxx
> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@xxxxxxxxxxxxx>
> ---
> fs/namespace.c | 90 +++++++++++++++++++++++++++++++++++
> include/linux/mnt_namespace.h | 3 ++
> 2 files changed, 93 insertions(+)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index bc0f15257b49..b74d00d6abb0 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -568,6 +568,96 @@ static int mnt_make_readonly(struct mount *mnt)
> return ret;
> }
>
> +struct bind_mount_list_item {
> + struct list_head list;
> + struct vfsmount *mnt;
> +};
> +
> +/*
> + * sb_revalidate_bindmounts - Relookup/reset bindmounts root dentries
> + *
> + * Useful if for some reason bindmount root dentries get invalidated
> + * but it's needed to revalidate existing bindmounts without remounting.
> + */
> +int sb_revalidate_bindmounts(struct super_block *sb)

It's difficult to not strongly dislike this patchset on the basis of the
need for a function like this alone. And I just have to say it even if I
normally try not to do this: This whole function is bonkers in my opinion.

But leaving that aside for a second. This really needs detailed
explanations on locking, assumptions, and an explanation what you're
doing here. This looks crazy to me and definitely isn't fit to be a
generic helper in this form.

> +{
> + struct mount *mnt;
> + struct bind_mount_list_item *bmnt, *next;
> + int err = 0;
> + struct vfsmount *root_mnt = NULL;
> + LIST_HEAD(mnt_to_update);
> + char *buf;
> +
> + buf = (char *) __get_free_page(GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + lock_mount_hash();
> + list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
> + /* we only want to touch bindmounts */
> + if (mnt->mnt.mnt_root == sb->s_root) {
> + if (!root_mnt)
> + root_mnt = mntget(&mnt->mnt);
> +
> + continue;
> + }
> +
> + bmnt = kzalloc(sizeof(struct bind_mount_list_item), GFP_NOWAIT | __GFP_NOWARN);

Allocating under lock_mount_hash() even if doable with this flag
combination should be avoided at all costs imho. That just seems hacky.

> + if (!bmnt) {
> + err = -ENOMEM;
> + goto exit;

You're exiting the function with lock_mount_hash() held...

> + }
> +
> + bmnt->mnt = mntget(&mnt->mnt);
> + list_add_tail(&bmnt->list, &mnt_to_update);
> + }
> + unlock_mount_hash();

You've dropped unlock_mount_hash() and the function doesn't hold
namespace_lock() and isn't documented as requiring the caller to hold
it. And the later patch that uses this doesn't afaict (although I
haven't looked at any of the fuse specific stuff). So this is open to
all sorts of races with mount and unmount now. This needs an explanation
why that doesn't matter.

> +
> + /* TODO: get rid of this limitation */

Confused about this comment.

> + if (!root_mnt) {
> + err = -ENOENT;
> + goto exit;
> + }
> +
> + list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {

No one else can access that list so list_for_each_entry_safe() seems
pointless.

> + struct vfsmount *cur_mnt = bmnt->mnt;
> + struct path path;
> + struct dentry *old_root;
> + char *p;
> +
> + p = dentry_path(cur_mnt->mnt_root, buf, PAGE_SIZE);
> + if (IS_ERR(p))
> + goto exit;
> +
> + /* TODO: are these lookup flags fully safe and correct? */
> + err = vfs_path_lookup(root_mnt->mnt_root, root_mnt,
> + p, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT|LOOKUP_REVAL, &path);
> + if (err)
> + goto exit;
> +
> + /* replace bindmount root dentry */
> + lock_mount_hash();
> + old_root = cur_mnt->mnt_root;
> + cur_mnt->mnt_root = dget(path.dentry);

mnt->mnt_root isn't protected by lock_mount_hash(). It's invariant after
it has been set and a lot of code just assumes that it's stable.

Top of my hat, since you're not holding namespace_lock() mount
propagation can be going on concurrently so propagate_one() can check if
(!is_subdir(mp->m_dentry, m->mnt.mnt_root)) while you're happily
updating it. A lot of code could actually be touching mnt->mnt_root
while you're updating it.

There's probably a lot more issues with this I'm just not seeing without
spending more time on this. But NAK on this as it stands. Sorry.

> + dput(old_root);
> + unlock_mount_hash();
> +
> + path_put(&path);
> + }
> +
> +exit:
> + free_page((unsigned long) buf);
> + mntput(root_mnt);
> + list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
> + list_del(&bmnt->list);
> + mntput(bmnt->mnt);
> + kfree(bmnt);
> + }
> +
> + return err;
> +}
> +EXPORT_SYMBOL_GPL(sb_revalidate_bindmounts);