Re: [PATCH] fix writing to the filesystem after unmount

From: Christian Brauner
Date: Wed Sep 06 2023 - 10:27:37 EST


On Wed, Sep 06, 2023 at 03:26:21PM +0200, Mikulas Patocka wrote:
> lvm may suspend any logical volume anytime. If lvm suspend races with
> unmount, it may be possible that the kernel writes to the filesystem after
> unmount successfully returned. The problem can be demonstrated with this
> script:
>
> #!/bin/sh -ex
> modprobe brd rd_size=4194304
> vgcreate vg /dev/ram0
> lvcreate -L 16M -n lv vg
> mkfs.ext4 /dev/vg/lv
> mount -t ext4 /dev/vg/lv /mnt/test
> dmsetup suspend /dev/vg/lv
> (sleep 1; dmsetup resume /dev/vg/lv) &
> umount /mnt/test
> md5sum /dev/vg/lv
> md5sum /dev/vg/lv
> dmsetup remove_all
> rmmod brd
>
> The script unmounts the filesystem and runs md5sum twice, the result is
> that these two invocations return different hash.
>
> What happens:
> * dmsetup suspend calls freeze_bdev, that goes to freeze_super and it
> increments sb->s_active
> * then we unmount the filesystem, we go to cleanup_mnt, cleanup_mnt calls
> deactivate_super, deactivate_super sees that sb->s_active is 2, so it
> decreases it to 1 and does nothing - the umount syscall returns
> successfully
> * now we have a mounted filesystem despite the fact that umount returned

That can happen for any number of reasons. Other codepaths might very
well still hold active references to the superblock. The same thing can
happen when you have your filesystem pinned in another mount namespace.

If you really want to be absolutely sure that the superblock is
destroyed you must use a mechanism like fanotify which allows you to get
notified on superblock destruction.

> @@ -1251,7 +1251,7 @@ static void cleanup_mnt(struct mount *mn
> }
> fsnotify_vfsmount_delete(&mnt->mnt);
> dput(mnt->mnt.mnt_root);
> - deactivate_super(mnt->mnt.mnt_sb);
> + wait_and_deactivate_super(mnt->mnt.mnt_sb);

Your patch means that we hang on any umount when the filesystem is
frozen. IOW, you'd also hang on any umount of a bind-mount. IOW, every
single container making use of this filesystems via bind-mounts would
hang on umount and shutdown.

You'd effectively build a deadlock trap for userspace when the
filesystem is frozen. And nothing can make progress until that thing is
thawed. Umount can't block if the block device is frozen.

> mnt_free_id(mnt);
> call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
> }
> Index: linux-2.6/fs/super.c
> ===================================================================
> --- linux-2.6.orig/fs/super.c 2023-09-05 21:09:16.000000000 +0200
> +++ linux-2.6/fs/super.c 2023-09-06 09:52:20.000000000 +0200
> @@ -36,6 +36,7 @@
> #include <linux/lockdep.h>
> #include <linux/user_namespace.h>
> #include <linux/fs_context.h>
> +#include <linux/delay.h>
> #include <uapi/linux/mount.h>
> #include "internal.h"
>
> @@ -365,6 +366,25 @@ void deactivate_super(struct super_block
> EXPORT_SYMBOL(deactivate_super);
>
> /**
> + * wait_and_deactivate_super - wait for unfreeze and drop a reference
> + * @s: superblock to deactivate
> + *
> + * Variant of deactivate_super(), except that we wait if the filesystem is
> + * frozen. This is required on unmount, to make sure that the filesystem is
> + * really unmounted when this function returns.
> + */
> +void wait_and_deactivate_super(struct super_block *s)
> +{
> + down_write(&s->s_umount);
> + while (s->s_writers.frozen != SB_UNFROZEN) {
> + up_write(&s->s_umount);
> + msleep(1);
> + down_write(&s->s_umount);
> + }
> + deactivate_locked_super(s);

That msleep(1) alone is a pretty nasty hack. We should definitely not
spin in code like this. That superblock could stay frozen for a long
time without s_umount held. So this is spinning.

Even if we wanted to do this it would need to use a similar wait
mechanism for the filesystem to be thawed like we do in
thaw_super_locked().