[PATCH v8 4/6] kernfs: Don't re-lock kernfs_root::kernfs_rwsem in kernfs_fop_readdir().

From: Zhu, Yi

Date: Mon Feb 23 2026 - 16:56:13 EST


Hi Sebastian,

We hit a potential ABBA deadlock while running selftests vfio_pci_device_test with this patch.

Thread 1 calls syscall getdents(2) on a kernfs directory: lock B->C->A dependency
kernfs_fop_readdir()
-> mutex_lock(&root->kernfs_rwsem): acquired root->kernfs_rwsem (Lock_A)
-> dir_emit()
-> filldir()
-> unsafe_copy_dirent_name(): <- Hit page fault
-> do_user_addr_fault()
-> lock_mm_and_find_vma()
-> mmap_read_lock_killable(): acquired read mm->mmap_lock (Lock_C): Lock C depends on A
...
-> handle_mm_fault()
-> handle_pte_fault()
-> __do_fault()
-> vm_ops->fault() = vfio_pci_mmap_page_fault()
-> vfio_pci_mmap_huge_fault(): will acquire vdev->memory_lock (Lock_B): Lock B depends on C

Thread 2 runs vfio_pci_device_test kernel selftests: lock A->B dependency
vfio_pci_irq_test__enable_trigger_disable()
-> vfio_pci_irq_enable()
-> vfio_pci_core_ioctl(VFIO_DEVICE_SET_IRQS...)
-> vfio_pci_ioctl_set_irqs()
-> vfio_irq_enable()
-> vfio_pci_memory_lock_and_enable(): acquired vdev->memory_lock (Lock_B)
-> pci_alloc_irq_vectors()
-> msi_setup_device_data()
-> msi_sysfs_create_group()
-> kernfs_create_dir_ns()
-> kernfs_add_one(): will acquire root->kernfs_rwsem (Lock_A): Lock A will depend on B

The full lockdep warning on v6.18 kernel can be found below.
Hardware: Intel Skylake
Kbuild: CONFIG_VFIO=y, CONFIG_IOMMUFD=y
Kernel Parameters: intel_iommu=on
--

[ 1055.224342] ======================================================
[ 1055.224346] WARNING: possible circular locking dependency detected
[ 1055.224349] 6.18.0-dbg-DEV #48 Tainted: G S O
[ 1055.224353] ------------------------------------------------------
[ 1055.224356] vfio_pci_device/15895 is trying to acquire lock:
[ 1055.224359] ffff94f340a64188 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_add_one+0x2d/0x330
[ 1055.224373]
but task is already holding lock:
[ 1055.224376] ffff94f3f376c960 (&vdev->memory_lock){++++}-{4:4}, at: vfio_pci_memory_lock_and_enable+0x2a/0x90
[ 1055.224387]
which lock already depends on the new lock.

[ 1055.224391]
the existing dependency chain (in reverse order) is:
[ 1055.224395]
-> #2 (&vdev->memory_lock){++++}-{4:4}:
[ 1055.224400] down_read+0x3f/0x180
[ 1055.224406] vfio_pci_mmap_huge_fault+0xbc/0x1a0
[ 1055.224410] __do_fault+0x46/0x180
[ 1055.224417] do_pte_missing+0x214/0x1300
[ 1055.224421] handle_mm_fault+0x795/0xb00
[ 1055.224425] do_user_addr_fault+0x4b7/0x6d0
[ 1055.224431] exc_page_fault+0x7d/0xe0
[ 1055.224437] asm_exc_page_fault+0x26/0x30
[ 1055.224452]
-> #1 (&mm->mmap_lock){++++}-{4:4}:
[ 1055.224457] down_read_killable+0x47/0x1b0
[ 1055.224460] mmap_read_lock_killable+0x12/0x50
[ 1055.224463] lock_mm_and_find_vma+0x125/0x140
[ 1055.224470] do_user_addr_fault+0x3d8/0x6d0
[ 1055.224474] exc_page_fault+0x7d/0xe0
[ 1055.224478] asm_exc_page_fault+0x26/0x30
[ 1055.224481] filldir+0xe4/0x1a0
[ 1055.224501] kernfs_fop_readdir+0x1af/0x2d0
[ 1055.224505] iterate_dir+0x84/0x160
[ 1055.224510] __se_sys_getdents+0x6c/0x110
[ 1055.224514] do_syscall_64+0xf0/0x930
[ 1055.224518] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 1055.224521]
-> #0 (&root->kernfs_rwsem){++++}-{4:4}:
[ 1055.224526] __lock_acquire+0x15a7/0x2d10
[ 1055.224531] lock_acquire+0xe8/0x2b0
[ 1055.224534] down_write+0x3a/0xb0
[ 1055.224537] kernfs_add_one+0x2d/0x330
[ 1055.224541] kernfs_create_dir_ns+0xab/0xe0
[ 1055.224545] internal_create_group+0x19d/0x4f0
[ 1055.224549] devm_device_add_group+0x49/0x80
[ 1055.224553] msi_setup_device_data+0x5b/0x100
[ 1055.224557] pci_setup_msi_context+0x1a/0xa0
[ 1055.224561] __pci_enable_msix_range+0x19b/0x220
[ 1055.224565] pci_alloc_irq_vectors_affinity+0xa0/0x130
[ 1055.224568] vfio_pci_set_msi_trigger+0x12b/0x300
[ 1055.224571] vfio_pci_core_ioctl+0x84c/0xf20
[ 1055.224575] vfio_device_fops_unl_ioctl+0x18d/0x350
[ 1055.224581] __se_sys_ioctl+0x71/0xc0
[ 1055.224585] do_syscall_64+0xf0/0x930
[ 1055.224588] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 1055.224591]
other info that might help us debug this:

[ 1055.224594] Chain exists of:
&root->kernfs_rwsem --> &mm->mmap_lock --> &vdev->memory_lock

[ 1055.224602] Possible unsafe locking scenario:

[ 1055.224605] CPU0 CPU1
[ 1055.224607] ---- ----
[ 1055.224610] lock(&vdev->memory_lock);
[ 1055.224613] lock(&mm->mmap_lock);
[ 1055.224616] lock(&vdev->memory_lock);
[ 1055.224631] lock(&root->kernfs_rwsem);
[ 1055.224633]
*** DEADLOCK ***

[ 1055.224636] 2 locks held by vfio_pci_device/15895:
[ 1055.224639] #0: ffff94f3f376c728 (&vdev->igate){+.+.}-{4:4}, at: vfio_pci_core_ioctl+0x830/0xf20
[ 1055.224646] #1: ffff94f3f376c960 (&vdev->memory_lock){++++}-{4:4}, at: vfio_pci_memory_lock_and_enable+0x2a/0x90
[ 1055.224654]
stack backtrace:
[ 1055.224657] CPU: 90 UID: 0 PID: 15895 Comm: vfio_pci_device Tainted: G S O 6.18.0-dbg-DEV #48 NONE
[ 1055.224660] Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE
[ 1055.224660] Hardware name: Google LLC Indus/Indus_QC_00, BIOS 30.116.4 08/29/2025
[ 1055.224662] Call Trace:
[ 1055.224663] <TASK>
[ 1055.224664] dump_stack_lvl+0x7d/0xc0
[ 1055.224667] print_circular_bug+0x2e8/0x300
[ 1055.224670] check_noncircular+0xff/0x120
[ 1055.224672] __lock_acquire+0x15a7/0x2d10
[ 1055.224675] ? __kernfs_new_node+0x8c/0x2b0
[ 1055.224679] ? kernfs_root+0x16/0x140
[ 1055.224680] ? kernfs_add_one+0x2d/0x330
[ 1055.224682] lock_acquire+0xe8/0x2b0
[ 1055.224684] ? kernfs_add_one+0x2d/0x330
[ 1055.224687] down_write+0x3a/0xb0
[ 1055.224689] ? kernfs_add_one+0x2d/0x330
[ 1055.224690] kernfs_add_one+0x2d/0x330
[ 1055.224693] kernfs_create_dir_ns+0xab/0xe0
[ 1055.224695] internal_create_group+0x19d/0x4f0
[ 1055.224698] devm_device_add_group+0x49/0x80
[ 1055.224700] msi_setup_device_data+0x5b/0x100
[ 1055.224702] pci_setup_msi_context+0x1a/0xa0
[ 1055.224703] __pci_enable_msix_range+0x19b/0x220
[ 1055.224705] pci_alloc_irq_vectors_affinity+0xa0/0x130
[ 1055.224708] vfio_pci_set_msi_trigger+0x12b/0x300
[ 1055.224710] vfio_pci_core_ioctl+0x84c/0xf20
[ 1055.224713] ? _raw_spin_unlock_irqrestore+0x35/0x50
[ 1055.224717] vfio_device_fops_unl_ioctl+0x18d/0x350
[ 1055.224720] __se_sys_ioctl+0x71/0xc0
[ 1055.224722] do_syscall_64+0xf0/0x930
[ 1055.224724] ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 1055.224725] ? exc_page_fault+0x9e/0xe0
[ 1055.224727] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 1055.224729] RIP: 0033:0x7fd526765887
[ 1055.224731] Code: cc cc cc 48 8b 05 99 39 07 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 cc cc cc cc cc cc cc cc cc cc b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 39 07 00 f7 d8 64 89 01 48
[ 1055.224733] RSP: 002b:00007ffff5c63438 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1055.224735] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd526765887
[ 1055.224736] RDX: 00007ffff5c63440 RSI: 0000000000003b6e RDI: 0000000000000007
[ 1055.224737] RBP: 00007ffff5c634e0 R08: 0000000020ff67e8 R09: 0000000000000000
[ 1055.224738] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000021a180
[ 1055.224738] R13: 00007fd526836798 R14: 00007ffff5c63c78 R15: 0000000000000001
[ 1055.224743] </TASK>


> The readdir operation iterates over all entries and invokes dir_emit()
> for every entry passing kernfs_node::name as argument.
> Since the name argument can change, and become invalid, the
> kernfs_root::kernfs_rwsem lock should not be dropped to prevent renames
> during the operation.
>
> The lock drop around dir_emit() has been initially introduced in commit
> 1e5289c97bba2 ("sysfs: Cache the last sysfs_dirent to improve readdir scalability v2")
>
> to avoid holding a global lock during a page fault. The lock drop is
> wrong since the support of renames and not a big burden since the lock
> is no longer global.
>
> Don't re-acquire kernfs_root::kernfs_rwsem while copying the name to the
> userpace buffer.
>
> Acked-by: Tejun Heo <tj@xxxxxxxxxx>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
> ---
> fs/kernfs/dir.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 5f0f8b95f44c0..43fbada678381 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1869,10 +1869,10 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
> file->private_data = pos;
> kernfs_get(pos);
>
> - up_read(&root->kernfs_rwsem);
> - if (!dir_emit(ctx, name, len, ino, type))
> + if (!dir_emit(ctx, name, len, ino, type)) {
> + up_read(&root->kernfs_rwsem);
> return 0;
> - down_read(&root->kernfs_rwsem);
> + }
> }
> up_read(&root->kernfs_rwsem);
> file->private_data = NULL;