Re: [PATCH] erofs: fix lockdep false positives on initializing erofs_pseudo_mnt

From: Gao Xiang
Date: Thu Mar 07 2024 - 04:20:48 EST


Hi Christian,

On 2024/3/7 17:17, Christian Brauner wrote:
On Thu, Mar 07, 2024 at 12:18:52PM +0800, Gao Xiang wrote:
Hi,

(try to +Cc Christian and Al here...)

On 2024/3/7 11:41, Jingbo Xu wrote:
Hi Baokun,

Thanks for catching this!


On 3/7/24 10:52 AM, Gao Xiang wrote:
Hi Baokun,

On 2024/3/7 10:44, Baokun Li wrote:
Lockdep reported the following issue when mounting erofs with a
domain_id:

============================================
WARNING: possible recursive locking detected
6.8.0-rc7-xfstests #521 Not tainted
--------------------------------------------
mount/396 is trying to acquire lock:
ffff907a8aaaa0e0 (&type->s_umount_key#50/1){+.+.}-{3:3},
                        at: alloc_super+0xe3/0x3d0

but task is already holding lock:
ffff907a8aaa90e0 (&type->s_umount_key#50/1){+.+.}-{3:3},
                        at: alloc_super+0xe3/0x3d0

other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&type->s_umount_key#50/1);
   lock(&type->s_umount_key#50/1);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

2 locks held by mount/396:
  #0: ffff907a8aaa90e0 (&type->s_umount_key#50/1){+.+.}-{3:3},
            at: alloc_super+0xe3/0x3d0
  #1: ffffffffc00e6f28 (erofs_domain_list_lock){+.+.}-{3:3},
            at: erofs_fscache_register_fs+0x3d/0x270 [erofs]

stack backtrace:
CPU: 1 PID: 396 Comm: mount Not tainted 6.8.0-rc7-xfstests #521
Call Trace:
  <TASK>
  dump_stack_lvl+0x64/0xb0
  validate_chain+0x5c4/0xa00
  __lock_acquire+0x6a9/0xd50
  lock_acquire+0xcd/0x2b0
  down_write_nested+0x45/0xd0
  alloc_super+0xe3/0x3d0
  sget_fc+0x62/0x2f0
  vfs_get_super+0x21/0x90
  vfs_get_tree+0x2c/0xf0
  fc_mount+0x12/0x40
  vfs_kern_mount.part.0+0x75/0x90
  kern_mount+0x24/0x40
  erofs_fscache_register_fs+0x1ef/0x270 [erofs]
  erofs_fc_fill_super+0x213/0x380 [erofs]

This is because the file_system_type of both erofs and the pseudo-mount
point of domain_id is erofs_fs_type, so two successive calls to
alloc_super() are considered to be using the same lock and trigger the
warning above.

Therefore add a nodev file_system_type named erofs_anon_fs_type to
silence this complaint. In addition, to reduce code coupling, refactor
out the erofs_anon_init_fs_context() and erofs_kill_pseudo_sb() functions
and move the erofs_pseudo_mnt related code to fscache.c.

Signed-off-by: Baokun Li <libaokun1@xxxxxxxxxx>

IMHO, in the beginning, I'd like to avoid introducing another fs type
for erofs to share (meta)data between filesystems since it will cause
churn, could we use some alternative way to resolve this?

Yeah as Gao Xiang said, this is initially intended to avoid introducing
anothoer file_system_type, say erofs_anon_fs_type.

What we need is actually a method of allocating anonymous inode as a
sentinel identifying each blob. There is indeed a global mount, i.e.
anon_inode_mnt, for allocating anonymous inode/file specifically. At
the time the share domain feature is introduced, there's only one
anonymous inode, i.e. anon_inode_inode, and all the allocated anonymous
files are bound to this single anon_inode_inode. Thus we decided to
implement a erofs internal pseudo mount for this usage.

But I noticed that we can now allocate unique anonymous inodes from
anon_inode_mnt since commit e7e832c ("fs: add LSM-supporting anon-inode
interface"), though the new interface is initially for LSM usage.

Yes, as summary, EROFS now maintains a bunch of anon inodes among
all different filesystem instances, so that like

blob sharing or
page cache sharing across filesystems can be done.

In brief, I think the following patch is a good idea but it
hasn't been landed until now:
https://lore.kernel.org/r/20210309155348.974875-3-hch@xxxxxx

Other than that, is it a good idea to introduce another fs type
(like erofs_anon_fs_type) for such usage?

It depends. If you're allocating a lot of inodes then having a separate
filesystem type for erofs makes sense. If it's just a few then it
probably doesn't matter. If you need custom inode operations for these
anonymous inodes then it also makes sense to have a separate filesystem
type.

Yeah, I think some time this year we will finish a formal
page cache sharing design and implementation for both bdev
and fscache mode.

So a separate filesystem type seems more reasonable in the
future, thanks for your confirmation!

Thanks,
Gao Xiang