Re: [PATCH 0/4] blk-rq-qos: fix possible deadlock
From: Yu Kuai
Date: Tue Oct 14 2025 - 21:36:09 EST
Hi,
在 2025/10/15 1:57, Nilay Shroff 写道:
On 10/14/25 4:44 PM, Yu Kuai wrote:
Hi,
在 2025/10/14 18:58, Nilay Shroff 写道:
On 10/14/25 7:51 AM, Yu Kuai wrote:
Currently rq-qos debugfs entries is created from rq_qos_add(), whileWhile freezing the queue we also mark GFP_NOIO scope, so doesn't that
rq_qos_add() requires queue to be freezed. This can deadlock because
creating new entries can trigger fs reclaim.
Fix this problem by delaying creating rq-qos debugfs entries until
it's initialization is complete.
- For wbt, it can be initialized by default of by blk-sysfs, fix it by
delaying after wbt_init();
- For other policies, they can only be initialized by blkg configuration,
fix it by delaying to blkg_conf_end();
Noted this set is cooked on the top of my other thread:
https://lore.kernel.org/all/20251010091446.3048529-1-yukuai@xxxxxxxxxx/
And the deadlock can be reporduced with above thead, by running blktests
throtl/001 with wbt enabled by default. While the deadlock is really a
long term problem.
help avoid fs-reclaim? Or maybe if you can share the lockdep splat
encountered running throtl/001?
Yes, we can avoid fs-reclaim if queue is freezing, however,
because debugfs is a generic file system, and we can't avoid fs reclaim from
all context. There is still
Following is the log with above set and wbt enabled by default, the set acquire
lock order by:
freeze queue -> elevator lock -> rq_qos_mutex -> blkcg_mutex
However, fs-reclaim from other context cause the deadlock report.
[ 45.632372][ T531] ======================================================
[ 45.633734][ T531] WARNING: possible circular locking dependency detected
[ 45.635062][ T531] 6.17.0-gfd4a560a0864-dirty #30 Not tainted
[ 45.636220][ T531] ------------------------------------------------------
[ 45.637587][ T531] check/531 is trying to acquire lock:
[ 45.638626][ T531] ffff9473884382b0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: blkg_
conf_start+0x116/0x190
[ 45.640416][ T531]
[ 45.640416][ T531] but task is already holding lock:
[ 45.641828][ T531] ffff9473884385d8 (&q->elevator_lock){+.+.}-{4:4}, at: blkg
_conf_start+0x108/0x190
[ 45.643322][ T531]
[ 45.643322][ T531] which lock already depends on the new lock.
[ 45.643322][ T531]
[ 45.644862][ T531]
[ 45.644862][ T531] the existing dependency chain (in reverse order) is:
[ 45.646046][ T531]
[ 45.646046][ T531] -> #5 (&q->elevator_lock){+.+.}-{4:4}:
[ 45.647052][ T531] __mutex_lock+0xd3/0x8d0
[ 45.647716][ T531] blkg_conf_start+0x108/0x190
[ 45.648395][ T531] tg_set_limit+0x74/0x300
[ 45.649046][ T531] kernfs_fop_write_iter+0x14a/0x210
[ 45.649813][ T531] vfs_write+0x29e/0x550
[ 45.650413][ T531] ksys_write+0x74/0xf0
[ 45.651032][ T531] do_syscall_64+0xbb/0x380
[ 45.651730][ T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 45.652533][ T531]
[ 45.652533][ T531] -> #4 (&q->q_usage_counter(io)#3){++++}-{0:0}:
[ 45.653297][ T531] blk_alloc_queue+0x30b/0x350
[ 45.653807][ T531] blk_mq_alloc_queue+0x5d/0xd0
[ 45.654300][ T531] __blk_mq_alloc_disk+0x13/0x60
[ 45.654810][ T531] null_add_dev+0x2f8/0x660 [null_blk]
[ 45.655374][ T531] nullb_device_power_store+0x88/0x130 [null_blk]
[ 45.656009][ T531] configfs_write_iter+0xcb/0x130 [configfs]
[ 45.656614][ T531] vfs_write+0x29e/0x550
[ 45.657045][ T531] ksys_write+0x74/0xf0
[ 45.657497][ T531] do_syscall_64+0xbb/0x380
[ 45.657958][ T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 45.658595][ T531]
[ 45.658595][ T531] -> #3 (fs_reclaim){+.+.}-{0:0}:
[ 45.659266][ T531] fs_reclaim_acquire+0xa4/0xe0
[ 45.659783][ T531] kmem_cache_alloc_lru_noprof+0x3b/0x2a0
[ 45.660369][ T531] alloc_inode+0x1a/0x60
[ 45.660873][ T531] new_inode+0xd/0x90
[ 45.661291][ T531] debugfs_create_dir+0x38/0x160
[ 45.661806][ T531] component_debug_init+0x12/0x20
[ 45.662316][ T531] do_one_initcall+0x68/0x2b0
[ 45.662807][ T531] kernel_init_freeable+0x238/0x290
[ 45.663241][ T531] kernel_init+0x15/0x130
[ 45.663659][ T531] ret_from_fork+0x182/0x1d0
[ 45.664054][ T531] ret_from_fork_asm+0x1a/0x30
[ 45.664601][ T531]
[ 45.664601][ T531] -> #2 (&sb->s_type->i_mutex_key#2){+.+.15:25:59 [194/1936]
[ 45.665274][ T531] down_write+0x3a/0xb0
[ 45.665669][ T531] simple_start_creating+0x51/0x110
[ 45.666097][ T531] debugfs_start_creating+0x68/0xd0
[ 45.666561][ T531] debugfs_create_dir+0x10/0x160
[ 45.666970][ T531] blk_register_queue+0x8b/0x1e0
[ 45.667386][ T531] __add_disk+0x253/0x3f0
[ 45.667804][ T531] add_disk_fwnode+0x71/0x180
[ 45.668205][ T531] virtblk_probe+0x9c2/0xb50
[ 45.668603][ T531] virtio_dev_probe+0x223/0x380
[ 45.669004][ T531] really_probe+0xc2/0x260
[ 45.669369][ T531] __driver_probe_device+0x6e/0x100
[ 45.669856][ T531] driver_probe_device+0x1a/0xd0
[ 45.670263][ T531] __driver_attach+0x93/0x1a0
[ 45.670672][ T531] bus_for_each_dev+0x87/0xe0
[ 45.671063][ T531] bus_add_driver+0xe0/0x1d0
[ 45.671469][ T531] driver_register+0x70/0xe0
[ 45.671856][ T531] virtio_blk_init+0x53/0x80
[ 45.672258][ T531] do_one_initcall+0x68/0x2b0
[ 45.672661][ T531] kernel_init_freeable+0x238/0x290
[ 45.673097][ T531] kernel_init+0x15/0x130
[ 45.673510][ T531] ret_from_fork+0x182/0x1d0
[ 45.673907][ T531] ret_from_fork_asm+0x1a/0x30
[ 45.674319][ T531]
[ 45.674319][ T531] -> #1 (&q->debugfs_mutex){+.+.}-{4:4}:
[ 45.674929][ T531] __mutex_lock+0xd3/0x8d0
[ 45.675315][ T531] rq_qos_add+0xe0/0x130
[ 45.675717][ T531] wbt_init+0x183/0x210
[ 45.676062][ T531] blk_register_queue+0xdf/0x1e0
[ 45.676490][ T531] __add_disk+0x253/0x3f0
[ 45.676844][ T531] add_disk_fwnode+0x71/0x180
[ 45.677247][ T531] virtblk_probe+0x9c2/0xb50
[ 45.677648][ T531] virtio_dev_probe+0x223/0x380
[ 45.678044][ T531] really_probe+0xc2/0x260
[ 45.678411][ T531] __driver_probe_device+0x6e/0x100
[ 45.678875][ T531] driver_probe_device+0x1a/0xd0
[ 45.679282][ T531] __driver_attach+0x93/0x1a0
[ 45.679678][ T531] bus_for_each_dev+0x87/0xe0
[ 45.680053][ T531] bus_add_driver+0xe0/0x1d0
[ 45.680458][ T531] driver_register+0x70/0xe0
[ 45.680823][ T531] virtio_blk_init+0x53/0x80
[ 45.681208][ T531] do_one_initcall+0x68/0x2b0
[ 45.681611][ T531] kernel_init_freeable+0x238/0x290
[ 45.682027][ T531] kernel_init+0x15/0x130
[ 45.682392][ T531] ret_from_fork+0x182/0x1d0
[ 45.682829][ T531] ret_from_fork_asm+0x1a/0x30
[ 45.683240][ T531]
[ 45.683240][ T531] -> #0 (&q->rq_qos_mutex){+.+.}-{4:4}:
[ 45.683867][ T531] __lock_acquire+0x103d/0x1650
[ 45.684411][ T531] lock_acquire+0xbc/0x2c0
[ 45.684798][ T531] __mutex_lock+0xd3/0x8d0
[ 45.685172][ T531] blkg_conf_start+0x116/0x190
[ 45.685623][ T531] tg_set_limit+0x74/0x300
[ 45.685986][ T531] kernfs_fop_write_iter+0x14a/0x210
[ 45.686405][ T531] vfs_write+0x29e/0x550
[ 45.686771][ T531] ksys_write+0x74/0xf0
[ 45.687112][ T531] do_syscall_64+0xbb/0x380
[ 45.687514][ T531] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 45.687983][ T531]
[ 45.687983][ T531] other info that might help us debug this:
[ 45.687983][ T531]
[ 45.688760][ T531] Chain exists of:
[ 45.688760][ T531] &q->rq_qos_mutex --> &q->q_usage_counter(io)#3 --> &q->e
levator_lock
[ 45.688760][ T531]
[ 45.689817][ T531] Possible unsafe locking scenario:
[ 45.689817][ T531]
[ 45.690359][ T531] CPU0 CPU1
[ 45.690764][ T531] ---- ----
[ 45.691172][ T531] lock(&q->elevator_lock);
[ 45.691544][ T531] lock(&q->q_usage_counter(io
)#3);
[ 45.692108][ T531] lock(&q->elevator_lock);
[ 45.692659][ T531] lock(&q->rq_qos_mutex);
[ 45.692994][ T531]
[ 45.692994][ T531] *** DEADLOCK ***
unable to acquire &q->debugfs_mutex and it's pending on thread #2From the above trace, I see that thread #1 (virtblk device) is
(another virtblk device) because thread #2 has already acquired
the same &q->debugfs_mutex. And this sequence of events appears
to be a false-positive. The reasoning is as follows:
Yes, this case #1 and #2 are both from __add_disk() and they can't
concurrent with each other for the same disk.
Each virtio block device (e.g., virtblk0, virtblk1, etc.) has its own
struct request_queue, and therefore its own instance of q->debugfs_mutex.
These mutexes are distinct objects at different memory addresses. Hence,
thread #1 cannot physically block on thread #2’s q->debugfs_mutex, since
the mutex wait queue is per-lock address.
However, lockdep does not track individual mutex addresses. Instead, it
operates at the level of lock classes, identified by a lock class key.
Because all q->debugfs_mutex instances share the same lock class key,
lockdep treats them as equivalent and reports a potential circular
dependency across devices, even though such dependency cannot actually
occur at runtime. So IMO, to eliminate this false-positive warning, we
should assign a separate lockdep class/key for each queue’s q->debugfs_mutex.
This will ensure lockdep distinguishes mutex instances belonging to
different queues.
Lockdep just shows possible circular locking dependency, and we can
absolutely avoid above warning with a seperate lockdep key. However, I
still feel fix the circular dependency by just don't create debugfs
entries while queue is freezed is better, unless it's too complicated.
There still might be other possible undiscover deadlocks somewhere...
BTW, we should probably document the locking orders somewhere.
Thanks,
Kuai
Thanks,
--Nilay
.