Re: question about bd_inode hashing against device_add() // Re: [PATCH 03/11] block: call bdev_add later in device_add_disk

From: Gao Xiang

Date: Fri Oct 31 2025 - 08:25:56 EST




On 2025/10/31 20:23, Gao Xiang wrote:


On 2025/10/31 18:12, Gao Xiang wrote:
Hi Greg,

On 2025/10/31 17:58, Greg Kroah-Hartman wrote:
On Fri, Oct 31, 2025 at 05:54:10PM +0800, Gao Xiang wrote:


On 2025/10/31 17:45, Christoph Hellwig wrote:

...

But why does the device node
get created earlier?  My assumption was that it would only be
created by the KOBJ_ADD uevent.  Adding the device model maintainers
as my little dig through the core drivers/base/ code doesn't find
anything to the contrary, but maybe I don't fully understand it.

AFAIK, device_add() is used to trigger devtmpfs file
creation, and it can be observed if frequently
hotpluging device in the VM and mount.  Currently
I don't have time slot to build an easy reproducer,
but I think it's a real issue anyway.

As I say above, that's not normal, and you have to be root to do this,
I just spent time to reproduce with dynamic loop devices and
actually it's easy if msleep() is located artificiallly,
the diff as below:

diff --git a/block/bdev.c b/block/bdev.c
index 810707cca970..a4273b5ad456 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -821,7 +821,7 @@ struct block_device *blkdev_get_no_open(dev_t dev, bool autoload)
     struct inode *inode;

     inode = ilookup(blockdev_superblock, dev);
-    if (!inode && autoload && IS_ENABLED(CONFIG_BLOCK_LEGACY_AUTOLOAD)) {
+    if (0) {
         blk_request_module(dev);
         inode = ilookup(blockdev_superblock, dev);
         if (inode)
diff --git a/block/genhd.c b/block/genhd.c
index 9bbc38d12792..3c9116fdc1ce 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -428,6 +428,8 @@ static void add_disk_final(struct gendisk *disk)
     set_bit(GD_ADDED, &disk->state);
 }

+#include <linux/delay.h>
+
 static int __add_disk(struct device *parent, struct gendisk *disk,
               const struct attribute_group **groups,
               struct fwnode_handle *fwnode)
@@ -497,6 +499,9 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
     if (ret)
         goto out_free_ext_minor;

+    if (disk->major == LOOP_MAJOR)
+        msleep(2500);           // delay 2.5s for all loops
+
     ret = disk_alloc_events(disk);
     if (ret)
         goto out_device_del;


(Note that I masked off CONFIG_BLOCK_LEGACY_AUTOLOAD
 for cleaner ftrace below.)

and then

# uname -a  (patched 6.18-rc1 kernel)

```
Linux 7e5b4b5f5181 6.18.0-rc1-dirty #25 SMP PREEMPT_DYNAMIC Fri Oct 31 19:52:10 CST 2025 x86_64 GNU/Linux
```

# truncate -s 1g test.img; mkfs.ext4 -F test.img;
# losetup /dev/loop999 test.img & sleep 1; ls -l /dev/loop999; strace mount -t ext4 /dev/loop999 mnt 2>&1 | grep fsconfig

It shows

```
brw------- 1 root root 7, 999 Oct 31 20:06 /dev/loop999
fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/loop999", 0) = 0
fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = -1 ENXIO (No such device or address)  // unexpected
```

then

# losetup /dev/loop996 test.img & sleep 1; stat /dev/loop996; trace-cmd record -p function_graph mount -t ext4 /dev/loop996 mnt &> /dev/null

It shows
```
  File: /dev/loop996
  Size: 0               Blocks: 0          IO Block: 4096   block special file
Device: 0,6     Inode: 429         Links: 1     Device type: 7,996
Access: (0600/brw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2025-10-31 20:07:54.938474868 +0800
Modify: 2025-10-31 20:07:54.938474868 +0800
Change: 2025-10-31 20:07:54.938474868 +0800
 Birth: 2025-10-31 20:07:54.938474868 +0800
```

but

# trace-cmd report | grep mount | less
           mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                bdev_file_open_by_dev() {
           mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                  bdev_permission() {
           mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                    devcgroup_check_permission() {
           mount-561   [007] ...1.   240.180513: funcgraph_entry:                   |                      __rcu_read_lock() {
           mount-561   [007] ...1.   240.180514: funcgraph_exit:         0.193 us   |                      } (ret=0x1)
           mount-561   [007] ...1.   240.180514: funcgraph_entry:                   |                      match_exception_partial() {
           mount-561   [007] ...1.   240.180514: funcgraph_exit:         0.199 us   |                      } (ret=0x0)
           mount-561   [007] ...1.   240.180514: funcgraph_entry:                   |                      __rcu_read_unlock() {
           mount-561   [007] ...1.   240.180515: funcgraph_exit:         0.202 us   |                      } (ret=0x0)
           mount-561   [007] ...1.   240.180515: funcgraph_exit:         1.602 us   |                    } (ret=0x0)
           mount-561   [007] ...1.   240.180515: funcgraph_exit:         2.100 us   |                  } (ret=0x0)
           mount-561   [007] ...1.   240.180515: funcgraph_entry:                   |                  ilookup() {
           mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                    __cond_resched() {
           mount-561   [007] ...1.   240.180516: funcgraph_exit:         0.194 us   |                    } (ret=0x0)
           mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                    find_inode_fast() {
           mount-561   [007] ...1.   240.180516: funcgraph_entry:                   |                      __rcu_read_lock() {
           mount-561   [007] ...1.   240.180516: funcgraph_exit:         0.195 us   |                      } (ret=0x1)
           mount-561   [007] ...1.   240.180517: funcgraph_entry:                   |                      __rcu_read_unlock() {
           mount-561   [007] ...1.   240.180517: funcgraph_exit:         0.193 us   |                      } (ret=0x0)
           mount-561   [007] ...1.   240.180517: funcgraph_exit:         1.060 us   |                    } (ret=0x0)
           mount-561   [007] ...1.   240.180517: funcgraph_exit:         1.970 us   |                  } (ret=0x0)
           mount-561   [007] ...1.   240.180518: funcgraph_exit:         4.818 us   |                } (ret=-6)

here -6 (-ENXIO) is unexpected.

Actually the problematic code path I've said is device_add():

upstream code:

loop_control_ioctl
 loop_add
   add_disk_fwnode
     __add_disk
       devtmpfs_create_node   // here create devtmpfs blkdev file, but racy
     add_disk_final
       bdev_add
         insert_inode_hash    // just seen by bdev_file_open_by_dev()
       disk_uevent(disk, KOBJ_ADD)

minor revision:

loop_control_ioctl
loop_add
add_disk_fwnode
__add_disk
device_add
devtmpfs_create_node // here create devtmpfs blkdev file, but racy
add_disk_final
bdev_add
insert_inode_hash // just seen by bdev_file_open_by_dev()
disk_uevent(disk, KOBJ_ADD)


I actually think it's enough to explain the root.

Thanks,
Gao Xiang