Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs

From: Bart Van Assche
Date: Wed Dec 23 2020 - 10:48:12 EST


On 12/23/20 3:40 AM, John Garry wrote:
> Sorry, I got the 2x iter functions mixed up.
>
> So if we use mutex to solve blk_mq_queue_tag_busy_iter() problem, then we
> still have this issue in blk_mq_tagset_busy_iter() which I report previously
> [0]:
>
> [  319.771745] BUG: KASAN: use-after-free in bt_tags_iter+0xe0/0x128
> [  319.777832] Read of size 4 at addr ffff0010b6bd27cc by task more/1866
> [  319.784262]
> [  319.785753] CPU: 61 PID: 1866 Comm: more Tainted: G        W
> 5.10.0-rc4-18118-gaa7b9c30d8ff #1070
> [  319.795312] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon
> D05 IT21 Nemo 2.0 RC0 04/18/2018
> [  319.804437] Call trace:
> [  319.806892]  dump_backtrace+0x0/0x2d0
> [  319.810552]  show_stack+0x18/0x68
> [  319.813865]  dump_stack+0x100/0x16c
> [  319.817348]  print_address_description.constprop.12+0x6c/0x4e8
> [  319.823176]  kasan_report+0x130/0x200
> [  319.826831]  __asan_load4+0x9c/0xd8
> [  319.830315]  bt_tags_iter+0xe0/0x128
> [  319.833884]  __blk_mq_all_tag_iter+0x320/0x3a8
> [  319.838320]  blk_mq_tagset_busy_iter+0x8c/0xd8
> [  319.842760]  scsi_host_busy+0x88/0xb8
> [  319.846418]  show_host_busy+0x1c/0x48
> [  319.850079]  dev_attr_show+0x44/0x90
> [  319.853655]  sysfs_kf_seq_show+0x128/0x1c8
> [  319.857744]  kernfs_seq_show+0xa0/0xb8
> [  319.861489]  seq_read_iter+0x1ec/0x6a0
> [  319.865230]  seq_read+0x1d0/0x250
> [  319.868539]  kernfs_fop_read+0x70/0x330
> [  319.872369]  vfs_read+0xe4/0x250
> [  319.875590]  ksys_read+0xc8/0x178
> [  319.878898]  __arm64_sys_read+0x44/0x58
> [  319.882730]  el0_svc_common.constprop.2+0xc4/0x1e8
> [  319.887515]  do_el0_svc+0x90/0xa0
> [  319.890824]  el0_sync_handler+0x128/0x178
> [  319.894825]  el0_sync+0x158/0x180
> [  319.898131]
> [  319.899614] The buggy address belongs to the page:
> [  319.904403] page:000000004e9e6864 refcount:0 mapcount:0
> mapping:0000000000000000 index:0x0 pfn:0x10b6bd2
> [  319.913876] flags: 0xbfffc0000000000()
> [  319.917626] raw: 0bfffc0000000000 0000000000000000 fffffe0000000000
> 0000000000000000
> [  319.925363] raw: 0000000000000000 0000000000000000 00000000ffffffff
> 0000000000000000
> [  319.933096] page dumped because: kasan: bad access detected
> [  319.938658]
> [  319.940141] Memory state around the buggy address:
> [  319.944925]  ffff0010b6bd2680: ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff ff ff
> [  319.952139]  ffff0010b6bd2700: ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff ff ff
> [  319.959354] >ffff0010b6bd2780: ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff ff ff
> [  319.966566] ^
> [  319.972131]  ffff0010b6bd2800: ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff ff ff
> [  319.979344]  ffff0010b6bd2880: ff ff ff ff ff ff ff ff ff ff ff ff ff
> ff ff ff
> [  319.986557]
> ==================================================================
> [  319.993770] Disabling lock debugging due to kernel taint
>
> So to trigger this, I start fio on a disk, and then have one script
> which constantly enables and disables an IO scheduler for that disk, and
> another script which constantly reads /sys/class/scsi_host/host0/host_busy .
>
> And in this problem, the driver tag we iterate may point to a stale IO sched
> request.

Hi John,

I propose to change the order in which blk_mq_sched_free_requests(q) and
blk_mq_debugfs_unregister(q) are called. Today blk_mq_sched_free_requests(q)
is called by blk_cleanup_queue() before blk_put_queue() is called.
blk_put_queue() calls blk_release_queue() if the last reference is dropped.
blk_release_queue() calls blk_mq_debugfs_unregister(). I prefer removing the
debugfs attributes earlier over modifying the tag iteration functions
because I think removing the debugfs attributes earlier is less risky.
Although this will make it harder to debug lockups that happen while
removing a request queue, kernel developers who are analyzing such an issue
can undo this change in their development kernel tree.

Thanks,

Bart.