Re: [PATCH] blk-mq: use mutex_trylock to avoid lock inversion

From: jianchao.wang
Date: Tue Jun 19 2018 - 22:09:22 EST


Hi Bart

On 06/19/2018 11:20 PM, Bart Van Assche wrote:
> On Tue, 2018-06-19 at 15:00 +0800, Jianchao Wang wrote:
>> Currently, the kobject_del for kobjs of mq, hctx and ctx is invoked
>> under sysfs_lock, lock inversion will come up when other one is
>> acessing the associated sysfs file and trying to acquire the
>> sysfs_lock. To fix it, use mutex_trylock in blk_mq_sysfs_ops and
>> blk_mq_hw_sysfs_ops, if the lock in on contending, return -EAGAIN.
>
> Is this a theoretical issue or something you actually ran into? Which lock
> other than sysfs_lock do you think is involved in the lock inversion?
>

It is very easy to reproduce with following scripts.

script 0
while true
do
modprobe null_blk queue_mode=2 shared_tags=1
sleep 0.1
rmmod null_blk
sleep 0.1
done

script 1
file0="/sys/block/nullb0/mq/0/nr_tags"
file1="/sys/block/nullb0/mq/0/cpu0/rq_list"
while true;
do
if [ -e $file0 ];then
cat $file0
fi
if [ -e $file1 ];then
cat $file1
fi
done

Here is the hung task log:

[ 246.752087] INFO: task rmmod:12789 blocked for more than 30 seconds.
[ 246.752801] Not tainted 4.18.0-rc1 #88
[ 246.753458] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 246.754192] rmmod D 0 12789 3142 0x00000080
[ 246.754951] Call Trace:
[ 246.755715] ? __schedule+0x3f9/0xae0
[ 246.756546] schedule+0x3c/0x90
[ 246.757440] __kernfs_remove+0x1d0/0x2b0
[ 246.757644] ? wait_woken+0xb0/0xb0
[ 246.757850] kernfs_remove+0x1f/0x30
[ 246.758059] kobject_del+0x13/0x40
[ 246.758271] blk_mq_unregister_dev+0x4f/0xb0
[ 246.758488] blk_unregister_queue+0x71/0x100
[ 246.758709] del_gendisk+0x139/0x280
[ 246.758936] null_del_dev+0x40/0xf0 [null_blk]
[ 246.759165] null_exit+0x50/0xbec [null_blk]
[ 246.759397] __x64_sys_delete_module+0x12e/0x1d0
[ 246.759636] do_syscall_64+0x5a/0x1a0
[ 246.759876] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 246.760129] RIP: 0033:0x7fc518522927
[ 246.760481] Code: Bad RIP value.
[ 246.760736] RSP: 002b:00007ffee4c69b68 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 246.761005] RAX: ffffffffffffffda RBX: 00007ffee4c69bc8 RCX: 00007fc518522927
[ 246.761309] RDX: 000000000000000a RSI: 0000000000000800 RDI: 000055783881b248
[ 246.761612] RBP: 000055783881b1e0 R08: 0000000000000000 R09: 1999999999999999
[ 246.761902] R10: 0000000000000000 R11: 0000000000000206 R12: 00007ffee4c69d90
[ 246.762199] R13: 00007ffee4c6b774 R14: 000055783881a010 R15: 000055783881b1e0
[ 246.762503] INFO: task cat:12790 blocked for more than 30 seconds.
[ 246.762812] Not tainted 4.18.0-rc1 #88
[ 246.763124] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 246.763453] cat D 0 12790 3141 0x00000080
[ 246.763789] Call Trace:
[ 246.764130] ? __schedule+0x3f9/0xae0
[ 246.764552] schedule+0x3c/0x90
[ 246.764895] schedule_preempt_disabled+0x14/0x20
[ 246.765244] __mutex_lock+0x41c/0x990
[ 246.765595] ? blk_mq_hw_sysfs_show+0x35/0x80
[ 246.765950] ? preempt_count_sub+0x92/0xd0
[ 246.766311] ? blk_mq_hw_sysfs_show+0x35/0x80
[ 246.766675] blk_mq_hw_sysfs_show+0x35/0x80
[ 246.767043] sysfs_kf_seq_show+0xad/0x100
[ 246.767416] seq_read+0xa5/0x410
[ 246.767790] __vfs_read+0x23/0x160
[ 246.768172] vfs_read+0xa0/0x140
[ 246.768627] ksys_read+0x45/0xa0
[ 246.769008] do_syscall_64+0x5a/0x1a0
[ 246.769391] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 246.769798] RIP: 0033:0x7f743e39e260
[ 246.770216] Code: Bad RIP value.

Thanks
Jianchao
>
>
>