Re: [RFC PATCH] block: Make request_queue lockdep splats show up earlier
From: Thomas Hellström
Date: Tue Mar 18 2025 - 03:59:33 EST
On Tue, 2025-03-18 at 09:18 +0800, Ming Lei wrote:
> On Mon, Mar 17, 2025 at 06:11:55PM +0100, Thomas Hellström wrote:
> > In recent kernels, there are lockdep splats around the
> > struct request_queue::io_lockdep_map, similar to [1], but they
> > typically don't show up until reclaim with writeback happens.
> >
> > Having multiple kernel versions released with a known risc of
> > kernel
> > deadlock during reclaim writeback should IMHO be addressed and
> > backported to -stable with the highest priority.
> >
> > In order to have these lockdep splats show up earlier,
> > preferrably during system initialization, prime the
> > struct request_queue::io_lockdep_map as GFP_KERNEL reclaim-
> > tainted. This will instead lead to lockdep splats looking similar
> > to [2], but without the need for reclaim + writeback
> > happening.
> >
> > [1]:
> > [ 189.762244]
> > ======================================================
> > [ 189.762432] WARNING: possible circular locking dependency
> > detected
> > [ 189.762441] 6.14.0-rc6-xe+ #6 Tainted: G U
> > [ 189.762450] ----------------------------------------------------
> > --
> > [ 189.762459] kswapd0/119 is trying to acquire lock:
> > [ 189.762467] ffff888110ceb710 (&q->q_usage_counter(io)#26){++++}-
> > {0:0}, at: __submit_bio+0x76/0x230
> > [ 189.762485]
> > but task is already holding lock:
> > [ 189.762494] ffffffff834c97c0 (fs_reclaim){+.+.}-{0:0}, at:
> > balance_pgdat+0xbe/0xb00
> > [ 189.762507]
> > which lock already depends on the new lock.
> >
> > [ 189.762519]
> > the existing dependency chain (in reverse order) is:
> > [ 189.762529]
> > -> #2 (fs_reclaim){+.+.}-{0:0}:
> > [ 189.762540] fs_reclaim_acquire+0xc5/0x100
> > [ 189.762548] kmem_cache_alloc_lru_noprof+0x4a/0x480
> > [ 189.762558] alloc_inode+0xaa/0xe0
> > [ 189.762566] iget_locked+0x157/0x330
> > [ 189.762573] kernfs_get_inode+0x1b/0x110
> > [ 189.762582] kernfs_get_tree+0x1b0/0x2e0
> > [ 189.762590] sysfs_get_tree+0x1f/0x60
> > [ 189.762597] vfs_get_tree+0x2a/0xf0
> > [ 189.762605] path_mount+0x4cd/0xc00
> > [ 189.762613] __x64_sys_mount+0x119/0x150
> > [ 189.762621] x64_sys_call+0x14f2/0x2310
> > [ 189.762630] do_syscall_64+0x91/0x180
> > [ 189.762637] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 189.762647]
> > -> #1 (&root->kernfs_rwsem){++++}-{3:3}:
> > [ 189.762659] down_write+0x3e/0xf0
> > [ 189.762667] kernfs_remove+0x32/0x60
> > [ 189.762676] sysfs_remove_dir+0x4f/0x60
> > [ 189.762685] __kobject_del+0x33/0xa0
> > [ 189.762709] kobject_del+0x13/0x30
> > [ 189.762716] elv_unregister_queue+0x52/0x80
> > [ 189.762725] elevator_switch+0x68/0x360
> > [ 189.762733] elv_iosched_store+0x14b/0x1b0
> > [ 189.762756] queue_attr_store+0x181/0x1e0
> > [ 189.762765] sysfs_kf_write+0x49/0x80
> > [ 189.762773] kernfs_fop_write_iter+0x17d/0x250
> > [ 189.762781] vfs_write+0x281/0x540
> > [ 189.762790] ksys_write+0x72/0xf0
> > [ 189.762798] __x64_sys_write+0x19/0x30
> > [ 189.762807] x64_sys_call+0x2a3/0x2310
> > [ 189.762815] do_syscall_64+0x91/0x180
> > [ 189.762823] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 189.762833]
> > -> #0 (&q->q_usage_counter(io)#26){++++}-{0:0}:
> > [ 189.762845] __lock_acquire+0x1525/0x2760
> > [ 189.762854] lock_acquire+0xca/0x310
> > [ 189.762861] blk_mq_submit_bio+0x8a2/0xba0
> > [ 189.762870] __submit_bio+0x76/0x230
> > [ 189.762878] submit_bio_noacct_nocheck+0x323/0x430
> > [ 189.762888] submit_bio_noacct+0x2cc/0x620
> > [ 189.762896] submit_bio+0x38/0x110
> > [ 189.762904] __swap_writepage+0xf5/0x380
> > [ 189.762912] swap_writepage+0x3c7/0x600
> > [ 189.762920] shmem_writepage+0x3da/0x4f0
> > [ 189.762929] pageout+0x13f/0x310
> > [ 189.762937] shrink_folio_list+0x61c/0xf60
> > [ 189.763261] evict_folios+0x378/0xcd0
> > [ 189.763584] try_to_shrink_lruvec+0x1b0/0x360
> > [ 189.763946] shrink_one+0x10e/0x200
> > [ 189.764266] shrink_node+0xc02/0x1490
> > [ 189.764586] balance_pgdat+0x563/0xb00
> > [ 189.764934] kswapd+0x1e8/0x430
> > [ 189.765249] kthread+0x10b/0x260
> > [ 189.765559] ret_from_fork+0x44/0x70
> > [ 189.765889] ret_from_fork_asm+0x1a/0x30
> > [ 189.766198]
> > other info that might help us debug this:
> >
> > [ 189.767089] Chain exists of:
> > &q->q_usage_counter(io)#26 --> &root->kernfs_rwsem
> > --> fs_reclaim
> >
> > [ 189.767971] Possible unsafe locking scenario:
> >
> > [ 189.768555] CPU0 CPU1
> > [ 189.768849] ---- ----
> > [ 189.769136] lock(fs_reclaim);
> > [ 189.769421] lock(&root-
> > >kernfs_rwsem);
> > [ 189.769714] lock(fs_reclaim);
> > [ 189.770016] rlock(&q->q_usage_counter(io)#26);
> > [ 189.770305]
> > *** DEADLOCK ***
> >
> > [ 189.771167] 1 lock held by kswapd0/119:
> > [ 189.771453] #0: ffffffff834c97c0 (fs_reclaim){+.+.}-{0:0}, at:
> > balance_pgdat+0xbe/0xb00
> > [ 189.771770]
> > stack backtrace:
> > [ 189.772351] CPU: 4 UID: 0 PID: 119 Comm: kswapd0 Tainted: G
> > U 6.14.0-rc6-xe+ #6
> > [ 189.772353] Tainted: [U]=USER
> > [ 189.772354] Hardware name: ASUS System Product Name/PRIME B560M-
> > A AC, BIOS 2001 02/01/2023
> > [ 189.772354] Call Trace:
> > [ 189.772355] <TASK>
> > [ 189.772356] dump_stack_lvl+0x6e/0xa0
> > [ 189.772359] dump_stack+0x10/0x18
> > [ 189.772360] print_circular_bug.cold+0x17a/0x1b7
> > [ 189.772363] check_noncircular+0x13a/0x150
> > [ 189.772365] ? __pfx_stack_trace_consume_entry+0x10/0x10
> > [ 189.772368] __lock_acquire+0x1525/0x2760
> > [ 189.772368] ? ret_from_fork_asm+0x1a/0x30
> > [ 189.772371] lock_acquire+0xca/0x310
> > [ 189.772372] ? __submit_bio+0x76/0x230
> > [ 189.772375] ? lock_release+0xd5/0x2c0
> > [ 189.772376] blk_mq_submit_bio+0x8a2/0xba0
> > [ 189.772378] ? __submit_bio+0x76/0x230
> > [ 189.772380] __submit_bio+0x76/0x230
> > [ 189.772382] ? trace_hardirqs_on+0x1e/0xe0
> > [ 189.772384] submit_bio_noacct_nocheck+0x323/0x430
> > [ 189.772386] ? submit_bio_noacct_nocheck+0x323/0x430
> > [ 189.772387] ? __might_sleep+0x58/0xa0
> > [ 189.772390] submit_bio_noacct+0x2cc/0x620
> > [ 189.772391] ? count_memcg_events+0x68/0x90
> > [ 189.772393] submit_bio+0x38/0x110
> > [ 189.772395] __swap_writepage+0xf5/0x380
> > [ 189.772396] swap_writepage+0x3c7/0x600
> > [ 189.772397] shmem_writepage+0x3da/0x4f0
> > [ 189.772401] pageout+0x13f/0x310
> > [ 189.772406] shrink_folio_list+0x61c/0xf60
> > [ 189.772409] ? isolate_folios+0xe80/0x16b0
> > [ 189.772410] ? mark_held_locks+0x46/0x90
> > [ 189.772412] evict_folios+0x378/0xcd0
> > [ 189.772414] ? evict_folios+0x34a/0xcd0
> > [ 189.772415] ? lock_is_held_type+0xa3/0x130
> > [ 189.772417] try_to_shrink_lruvec+0x1b0/0x360
> > [ 189.772420] shrink_one+0x10e/0x200
> > [ 189.772421] shrink_node+0xc02/0x1490
> > [ 189.772423] ? shrink_node+0xa08/0x1490
> > [ 189.772424] ? shrink_node+0xbd8/0x1490
> > [ 189.772425] ? mem_cgroup_iter+0x366/0x480
> > [ 189.772427] balance_pgdat+0x563/0xb00
> > [ 189.772428] ? balance_pgdat+0x563/0xb00
> > [ 189.772430] ? trace_hardirqs_on+0x1e/0xe0
> > [ 189.772431] ? finish_task_switch.isra.0+0xcb/0x330
> > [ 189.772433] ? __switch_to_asm+0x33/0x70
> > [ 189.772437] kswapd+0x1e8/0x430
> > [ 189.772438] ? __pfx_autoremove_wake_function+0x10/0x10
> > [ 189.772440] ? __pfx_kswapd+0x10/0x10
> > [ 189.772441] kthread+0x10b/0x260
> > [ 189.772443] ? __pfx_kthread+0x10/0x10
> > [ 189.772444] ret_from_fork+0x44/0x70
> > [ 189.772446] ? __pfx_kthread+0x10/0x10
> > [ 189.772447] ret_from_fork_asm+0x1a/0x30
> > [ 189.772450] </TASK>
> >
> > [2]:
> > [ 8.760253]
> > ======================================================
> > [ 8.760254] WARNING: possible circular locking dependency
> > detected
> > [ 8.760255] 6.14.0-rc6-xe+ #7 Tainted: G U
> > [ 8.760256] ----------------------------------------------------
> > --
> > [ 8.760257] (udev-worker)/674 is trying to acquire lock:
> > [ 8.760259] ffff888100e39148 (&root->kernfs_rwsem){++++}-{3:3},
> > at: kernfs_remove+0x32/0x60
> > [ 8.760265]
> > but task is already holding lock:
> > [ 8.760266] ffff888110dc7680 (&q->q_usage_counter(io)#27){++++}-
> > {0:0}, at: blk_mq_freeze_queue_nomemsave+0x12/0x30
> > [ 8.760272]
> > which lock already depends on the new lock.
> >
> > [ 8.760272]
> > the existing dependency chain (in reverse order) is:
> > [ 8.760273]
> > -> #2 (&q->q_usage_counter(io)#27){++++}-{0:0}:
> > [ 8.760276] blk_alloc_queue+0x30a/0x350
> > [ 8.760279] blk_mq_alloc_queue+0x6b/0xe0
> > [ 8.760281] scsi_alloc_sdev+0x276/0x3c0
> > [ 8.760284] scsi_probe_and_add_lun+0x22a/0x440
> > [ 8.760286] __scsi_scan_target+0x109/0x230
> > [ 8.760288] scsi_scan_channel+0x65/0xc0
> > [ 8.760290] scsi_scan_host_selected+0xff/0x140
> > [ 8.760292] do_scsi_scan_host+0xa7/0xc0
> > [ 8.760293] do_scan_async+0x1c/0x160
> > [ 8.760295] async_run_entry_fn+0x32/0x150
> > [ 8.760299] process_one_work+0x224/0x5f0
> > [ 8.760302] worker_thread+0x1d4/0x3e0
> > [ 8.760304] kthread+0x10b/0x260
> > [ 8.760306] ret_from_fork+0x44/0x70
> > [ 8.760309] ret_from_fork_asm+0x1a/0x30
> > [ 8.760312]
> > -> #1 (fs_reclaim){+.+.}-{0:0}:
> > [ 8.760315] fs_reclaim_acquire+0xc5/0x100
> > [ 8.760317] kmem_cache_alloc_lru_noprof+0x4a/0x480
> > [ 8.760319] alloc_inode+0xaa/0xe0
> > [ 8.760322] iget_locked+0x157/0x330
> > [ 8.760323] kernfs_get_inode+0x1b/0x110
> > [ 8.760325] kernfs_get_tree+0x1b0/0x2e0
> > [ 8.760327] sysfs_get_tree+0x1f/0x60
> > [ 8.760329] vfs_get_tree+0x2a/0xf0
> > [ 8.760332] path_mount+0x4cd/0xc00
> > [ 8.760334] __x64_sys_mount+0x119/0x150
> > [ 8.760336] x64_sys_call+0x14f2/0x2310
> > [ 8.760338] do_syscall_64+0x91/0x180
> > [ 8.760340] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 8.760342]
> > -> #0 (&root->kernfs_rwsem){++++}-{3:3}:
> > [ 8.760345] __lock_acquire+0x1525/0x2760
> > [ 8.760347] lock_acquire+0xca/0x310
> > [ 8.760348] down_write+0x3e/0xf0
> > [ 8.760350] kernfs_remove+0x32/0x60
> > [ 8.760351] sysfs_remove_dir+0x4f/0x60
> > [ 8.760353] __kobject_del+0x33/0xa0
> > [ 8.760355] kobject_del+0x13/0x30
> > [ 8.760356] elv_unregister_queue+0x52/0x80
> > [ 8.760358] elevator_switch+0x68/0x360
> > [ 8.760360] elv_iosched_store+0x14b/0x1b0
> > [ 8.760362] queue_attr_store+0x181/0x1e0
> > [ 8.760364] sysfs_kf_write+0x49/0x80
> > [ 8.760366] kernfs_fop_write_iter+0x17d/0x250
> > [ 8.760367] vfs_write+0x281/0x540
> > [ 8.760370] ksys_write+0x72/0xf0
> > [ 8.760372] __x64_sys_write+0x19/0x30
> > [ 8.760374] x64_sys_call+0x2a3/0x2310
> > [ 8.760376] do_syscall_64+0x91/0x180
> > [ 8.760377] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 8.760380]
> > other info that might help us debug this:
> >
> > [ 8.760380] Chain exists of:
> > &root->kernfs_rwsem --> fs_reclaim --> &q-
> > >q_usage_counter(io)#27
> >
> > [ 8.760384] Possible unsafe locking scenario:
> >
> > [ 8.760384] CPU0 CPU1
> > [ 8.760385] ---- ----
> > [ 8.760385] lock(&q->q_usage_counter(io)#27);
> > [ 8.760387] lock(fs_reclaim);
> > [ 8.760388] lock(&q-
> > >q_usage_counter(io)#27);
> > [ 8.760390] lock(&root->kernfs_rwsem);
> > [ 8.760391]
> > *** DEADLOCK ***
> >
> > [ 8.760391] 6 locks held by (udev-worker)/674:
> > [ 8.760392] #0: ffff8881209ac420 (sb_writers#4){.+.+}-{0:0},
> > at: ksys_write+0x72/0xf0
> > [ 8.760398] #1: ffff88810c80f488 (&of->mutex#2){+.+.}-{3:3},
> > at: kernfs_fop_write_iter+0x136/0x250
> > [ 8.760402] #2: ffff888125d1d330 (kn->active#101){.+.+}-{0:0},
> > at: kernfs_fop_write_iter+0x13f/0x250
> > [ 8.760406] #3: ffff888110dc7bb0 (&q->sysfs_lock){+.+.}-{3:3},
> > at: queue_attr_store+0x148/0x1e0
> > [ 8.760411] #4: ffff888110dc7680 (&q-
> > >q_usage_counter(io)#27){++++}-{0:0}, at:
> > blk_mq_freeze_queue_nomemsave+0x12/0x30
> > [ 8.760416] #5: ffff888110dc76b8 (&q-
> > >q_usage_counter(queue)#27){++++}-{0:0}, at:
> > blk_mq_freeze_queue_nomemsave+0x12/0x30
> > [ 8.760421]
> > stack backtrace:
> > [ 8.760422] CPU: 7 UID: 0 PID: 674 Comm: (udev-worker) Tainted:
> > G U 6.14.0-rc6-xe+ #7
> > [ 8.760424] Tainted: [U]=USER
> > [ 8.760425] Hardware name: ASUS System Product Name/PRIME B560M-
> > A AC, BIOS 2001 02/01/2023
> > [ 8.760426] Call Trace:
> > [ 8.760427] <TASK>
> > [ 8.760428] dump_stack_lvl+0x6e/0xa0
> > [ 8.760431] dump_stack+0x10/0x18
> > [ 8.760433] print_circular_bug.cold+0x17a/0x1b7
> > [ 8.760437] check_noncircular+0x13a/0x150
> > [ 8.760441] ? save_trace+0x54/0x360
> > [ 8.760445] __lock_acquire+0x1525/0x2760
> > [ 8.760446] ? irqentry_exit+0x3a/0xb0
> > [ 8.760448] ? sysvec_apic_timer_interrupt+0x57/0xc0
> > [ 8.760452] lock_acquire+0xca/0x310
> > [ 8.760453] ? kernfs_remove+0x32/0x60
> > [ 8.760457] down_write+0x3e/0xf0
> > [ 8.760459] ? kernfs_remove+0x32/0x60
> > [ 8.760460] kernfs_remove+0x32/0x60
> > [ 8.760462] sysfs_remove_dir+0x4f/0x60
> > [ 8.760464] __kobject_del+0x33/0xa0
> > [ 8.760466] kobject_del+0x13/0x30
> > [ 8.760467] elv_unregister_queue+0x52/0x80
> > [ 8.760470] elevator_switch+0x68/0x360
> > [ 8.760472] elv_iosched_store+0x14b/0x1b0
> > [ 8.760475] queue_attr_store+0x181/0x1e0
> > [ 8.760479] ? lock_acquire+0xca/0x310
> > [ 8.760480] ? kernfs_fop_write_iter+0x13f/0x250
> > [ 8.760482] ? lock_is_held_type+0xa3/0x130
> > [ 8.760485] sysfs_kf_write+0x49/0x80
> > [ 8.760487] kernfs_fop_write_iter+0x17d/0x250
> > [ 8.760489] vfs_write+0x281/0x540
> > [ 8.760494] ksys_write+0x72/0xf0
> > [ 8.760497] __x64_sys_write+0x19/0x30
> > [ 8.760499] x64_sys_call+0x2a3/0x2310
> > [ 8.760502] do_syscall_64+0x91/0x180
> > [ 8.760504] ? trace_hardirqs_off+0x5d/0xe0
> > [ 8.760506] ? handle_softirqs+0x479/0x4d0
> > [ 8.760508] ? hrtimer_interrupt+0x13f/0x280
> > [ 8.760511] ? irqentry_exit_to_user_mode+0x8b/0x260
> > [ 8.760513] ? clear_bhb_loop+0x15/0x70
> > [ 8.760515] ? clear_bhb_loop+0x15/0x70
> > [ 8.760516] ? clear_bhb_loop+0x15/0x70
> > [ 8.760518] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 8.760520] RIP: 0033:0x7aa3bf2f5504
> > [ 8.760522] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f
> > 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 8b 10 00 00 74 13 b8 01
> > 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48
> > 83 ec 20 48 89
> > [ 8.760523] RSP: 002b:00007ffc1e3697d8 EFLAGS: 00000202
> > ORIG_RAX: 0000000000000001
> > [ 8.760526] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
> > 00007aa3bf2f5504
> > [ 8.760527] RDX: 0000000000000003 RSI: 00007ffc1e369ae0 RDI:
> > 000000000000001c
> > [ 8.760528] RBP: 00007ffc1e369800 R08: 00007aa3bf3f51c8 R09:
> > 00007ffc1e3698b0
> > [ 8.760528] R10: 0000000000000000 R11: 0000000000000202 R12:
> > 0000000000000003
> > [ 8.760529] R13: 00007ffc1e369ae0 R14: 0000613ccf21f2f0 R15:
> > 00007aa3bf3f4e80
> > [ 8.760533] </TASK>
> >
> > Cc: Jens Axboe <axboe@xxxxxxxxx>
> > Cc: linux-block@xxxxxxxxxxxxxxx
> > Cc: linux-kernel@xxxxxxxxxxxxxxx
> > Cc: Ming Lei <ming.lei@xxxxxxxxxx>
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>
> > ---
> > block/blk-core.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index d6c4fa3943b5..4aa439309406 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -455,6 +455,12 @@ struct request_queue *blk_alloc_queue(struct
> > queue_limits *lim, int node_id)
> > lockdep_init_map(&q->q_lockdep_map, "&q-
> > >q_usage_counter(queue)",
> > &q->q_lock_cls_key, 0);
> >
> > + /* Prime io_lockdep_map as reclaim tainted */
>
> Maybe it is more readable to change to:
>
> /* Teach lockdep about lock ordering wrt. queue freeze lock */
OK, I'll respin as a non-RFC patch with that updated.
>
> > + fs_reclaim_acquire(GFP_KERNEL);
> > + rwsem_acquire_read(&q->io_lockdep_map, 0, 0, _RET_IP_);
> > + rwsem_release(&q->io_lockdep_map, _RET_IP_);
> > + fs_reclaim_release(GFP_KERNEL);
>
> Looks fine, especially this approach has been used widely in tree:
>
> Reviewed-by: Ming Lei <ming.lei@xxxxxxxxxx>
One caveat though. If this is merged, people will probably start seeing
the lockdep splat as a regression and bisects will point to this commit
demanding the potential deadlock(s) to be fixed.
Is there a line of sight to fix the real deadlock?
Thanks,
Thomas
>
>
> Thanks,
> Ming
>