Re: [PATCH] nvme: unquiesce the queue before cleaup it

From: jianchao.wang
Date: Fri Apr 27 2018 - 04:52:04 EST


Hi Max

On 04/26/2018 06:23 PM, Max Gurtovoy wrote:
> Hi Jianchao,
> I actually tried this scenario with real HW and was able to repro the hang.
> Unfortunatly, after applying your patch I got NULL deref:
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
> WARNING: CPU: 5 PID: 0 at block/blk-mq.c:526 __blk_mq_complete_request+0x154/0x160
>

Could you please share the whole backtrace here ?

Thanks
Jianchao

> I guess it's the same issue the IsraelR/Bart/Ming are trying to fix in the block layer regarding completing requests.
> I'll add IsraelR proposed fix to nvme-rdma that is currently on hold and see what happens.
> Nontheless, I don't like the situation that the reset and delete flows can run concurrently.
>
> -Max.
>
> On 4/26/2018 11:27 AM, jianchao.wang wrote:
>> Hi Max
>>
>> I did a similar test on nvme-rdma, the underlying fabric is soft-RoCE.
>> A io loop, reset controller loop and a delete/create controller loop.
>> And found io hang below:
>>
>> [Â 230.884590] WARNING: CPU: 0 PID: 150 at /home/will/u04/source_code/linux-stable/drivers/nvme/host/rdma.c:1755 nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma]
>> [Â 230.884689] CPU: 0 PID: 150 Comm: kworker/u16:3 Not tainted 4.17.0-rc1+ #16
>> [Â 230.884690] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
>> [Â 230.884693] Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work [nvme_rdma]
>> [Â 230.884696] RIP: 0010:nvme_rdma_reset_ctrl_work+0x3d/0xb0 [nvme_rdma]
>> [Â 230.884697] RSP: 0018:ffffa9e30255be40 EFLAGS: 00010246
>> [Â 230.884699] RAX: 0000000000000000 RBX: ffff9633b64c2000 RCX: 0000000000000000
>> [Â 230.884700] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff9ac05516
>> [Â 230.884701] RBP: ffff9633b64c23b8 R08: 0000000000000001 R09: 0000000000000000
>> [Â 230.884702] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9633b64c2970
>> [Â 230.884703] R13: 0000000000000000 R14: 0ffff96340ffcc80 R15: ffff9634102af9c0
>> [Â 230.884705] FS:Â 0000000000000000(0000) GS:ffff963422c00000(0000) knlGS:0000000000000000
>> [Â 230.884706] CS:Â 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [Â 230.884707] CR2: 0000000001ce9ca4 CR3: 000000038040f005 CR4: 00000000003606f0
>> [Â 230.884708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [Â 230.884709] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [Â 230.884710] Call Trace:
>> [Â 230.884717]Â process_one_work+0x1c0/0x680
>> [Â 230.884722]Â worker_thread+0x22b/0x430
>> [Â 230.884728]Â kthread+0x100/0x140
>> [Â 230.884730]Â ? rescuer_thread+0x370/0x370
>> [Â 230.884731]Â ? kthread_delayed_work_timer_fn+0x80/0x80
>> [Â 230.884736]Â ret_from_fork+0x24/0x30
>> [Â 230.884744] Code: ff 48 8d ab b8 03 00 00 48 89 ef e8 1e 25 d7 d9 31 f6 48 89 df e8 44 f9 ff ff be 04 00 00 00 48 89 ef e8 b7 13 d7 d9 84 c0 75 07 <0f> 0b 5b 5d 41 5c c3 31 f6 48 89 df e8 b2 fa ff ff 85 c0 75 39
>> [Â 230.884799] irq event stamp: 88150
>> [ 230.884802] hardirqs last enabled at (88149): [<ffffffff9ac05529>] _raw_spin_unlock_irqrestore+0x59/0x70
>> [Â 230.884803] hardirqs last disabled at (88150): [<ffffffff9ae0116c>] error_entry+0x6c/0xc0
>> [ 230.884805] softirqs last enabled at (87998): [<ffffffff9aa603a0>] sk_common_release+0x60/0xd0
>> [Â 230.884807] softirqs last disabled at (87996): [<ffffffff9aa60375>] sk_common_release+0x35/0xd0
>> [Â 230.884808] ---[ end trace c03fac253b80d77d ]---
>> [Â 277.472094] INFO: task kworker/u16:0:6 blocked for more than 30 seconds.
>> [Â 277.473482]ÂÂÂÂÂÂ Tainted: GÂÂÂÂÂÂÂ WÂÂÂÂÂÂÂÂ 4.17.0-rc1+ #16
>> [Â 277.474825] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [Â 277.476293] kworker/u16:0ÂÂ DÂÂÂ 0ÂÂÂÂ 6ÂÂÂÂÂ 2 0x80000000
>> [Â 277.476317] Workqueue: nvme-delete-wq nvme_delete_ctrl_work
>> [Â 277.476335] Call Trace:
>> [Â 277.476365]Â ? __schedule+0x3de/0xac0
>> [Â 277.476393]Â schedule+0x3c/0x90
>> [Â 277.476405]Â blk_mq_freeze_queue_wait+0x44/0x90
>> [Â 277.476419]Â ? wait_woken+0x90/0x90
>> [Â 277.476437]Â blk_cleanup_queue+0xe1/0x280
>> [Â 277.476453]Â nvme_ns_remove+0x1c8/0x260
>> [Â 277.476475]Â nvme_remove_namespaces+0x7f/0xa0
>> [Â 277.476495]Â nvme_delete_ctrl_work+0x4b/0x80
>> [Â 277.476508]Â process_one_work+0x1c0/0x680
>> [Â 277.476535]Â worker_thread+0x22b/0x430
>> [Â 277.476562]Â kthread+0x100/0x140
>> [Â 277.476572]Â ? rescuer_thread+0x370/0x370
>> [Â 277.476579]Â ? kthread_delayed_work_timer_fn+0x80/0x80
>> [Â 277.476598]Â ret_from_fork+0x24/0x30
>>
>> 00000000ea33345b {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=19, .internal_tag=-1}
>> 0000000071218c5a {.op=READ, .cmd_flags=23, .rq_flags=MQ_INFLIGHT|IO_STAT, .state=idle, .tag=45, .internal_tag=-1}
>> root@will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme1c1n1# cat state
>> DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED
>>
>> This is due to a race following:
>>
>> nvme_delete_ctrlÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ nvme_rdma_reset_ctrl_work
>> ÂÂ -> change state to DELETING
>> ÂÂ -> queue delete_work
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ -> nvme_rdma_shutdown_ctrl
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ -> nvme_stop_queues
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ -> blk_mq_quiesce_queue
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ -> change to LIVE state fails
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ -> return
>> nvme_delete_ctrl_work
>> ÂÂ -> nvme_remove_namespaces
>> ÂÂÂÂ -> nvme_ns_remove
>> ÂÂÂÂÂÂ -> blk_cleanup_queue
>> ÂÂÂÂÂÂÂÂ -> blk_freeze_queue
>>
>> This patch could also fix this issue.
>>
>> Thanks
>> Jianchao
>>
>> On 04/22/2018 11:00 PM, jianchao.wang wrote:
>>> Hi Max
>>>
>>> That's really appreciated!
>>> Here is my test script.
>>>
>>> loop_reset_controller.sh
>>> #!/bin/bash
>>> while true
>>> do
>>> ÂÂÂÂecho 1 > /sys/block/nvme0n1/device/reset_controller
>>> ÂÂÂÂsleep 1
>>> done
>>>
>>> loop_unbind_driver.sh
>>> #!/bin/bash
>>> while true
>>> do
>>> ÂÂÂÂecho "0000:02:00.0" > /sys/bus/pci/drivers/nvme/unbind
>>> ÂÂÂÂsleep 2
>>> ÂÂÂÂecho "0000:02:00.0" > /sys/bus/pci/drivers/nvme/bind
>>> ÂÂÂÂsleep 2
>>> done
>>>
>>> loop_io.sh
>>> #!/bin/bash
>>>
>>> file="/dev/nvme0n1"
>>> echo $file
>>> while true;
>>> do
>>> ÂÂÂÂif [ -e $file ];then
>>> ÂÂÂÂÂÂÂ fio fio_job_rand_read.ini
>>> ÂÂÂÂelse
>>> ÂÂÂÂÂÂÂ echo "Not found"
>>> ÂÂÂÂÂÂÂ sleep 1
>>> ÂÂÂÂfi
>>> done
>>>
>>> The fio jobs is as below:
>>> size=512m
>>> rw=randread
>>> bs=4k
>>> ioengine=libaio
>>> iodepth=64
>>> direct=1
>>> numjobs=16
>>> filename=/dev/nvme0n1
>>> group_reporting
>>>
>>> I started in sequence, loop_io.sh, loop_reset_controller.sh, loop_unbind_driver.sh.
>>> And if lucky, I will get io hang in 3 minutes. ;)
>>> Such as:
>>>
>>> [Â 142.858074] nvme nvme0: pci function 0000:02:00.0
>>> [Â 144.972256] nvme nvme0: failed to mark controller state 1
>>> [Â 144.972289] nvme nvme0: Removing after probe failure status: 0
>>> [Â 185.312344] INFO: task bash:1673 blocked for more than 30 seconds.
>>> [Â 185.312889]ÂÂÂÂÂÂ Not tainted 4.17.0-rc1+ #6
>>> [Â 185.312950] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> [Â 185.313049] bashÂÂÂÂÂÂÂÂÂÂÂ DÂÂÂ 0Â 1673ÂÂ 1629 0x00000080
>>> [Â 185.313061] Call Trace:
>>> [Â 185.313083]Â ? __schedule+0x3de/0xac0
>>> [Â 185.313103]Â schedule+0x3c/0x90
>>> [Â 185.313111]Â blk_mq_freeze_queue_wait+0x44/0x90
>>> [Â 185.313123]Â ? wait_woken+0x90/0x90
>>> [Â 185.313133]Â blk_cleanup_queue+0xe1/0x280
>>> [Â 185.313145]Â nvme_ns_remove+0x1c8/0x260
>>> [Â 185.313159]Â nvme_remove_namespaces+0x7f/0xa0
>>> [Â 185.313170]Â nvme_remove+0x6c/0x130
>>> [Â 185.313181]Â pci_device_remove+0x36/0xb0
>>> [Â 185.313193]Â device_release_driver_internal+0x160/0x230
>>> [Â 185.313205]Â unbind_store+0xfe/0x150
>>> [Â 185.313219]Â kernfs_fop_write+0x114/0x190
>>> [Â 185.313234]Â __vfs_write+0x23/0x150
>>> [Â 185.313246]Â ? rcu_read_lock_sched_held+0x3f/0x70
>>> [Â 185.313252]Â ? preempt_count_sub+0x92/0xd0
>>> [Â 185.313259]Â ? __sb_start_write+0xf8/0x200
>>> [Â 185.313271]Â vfs_write+0xc5/0x1c0
>>> [Â 185.313284]Â ksys_write+0x45/0xa0
>>> [Â 185.313298]Â do_syscall_64+0x5a/0x1a0
>>> [Â 185.313308]Â entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>
>>> And get following information in block debugfs:
>>> root@will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat hctx6/cpu6/rq_list
>>> 000000001192d19b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=69, .internal_tag=-1}
>>> 00000000c33c8a5b {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=78, .internal_tag=-1}
>>> root@will-ThinkCentre-M910s:/sys/kernel/debug/block/nvme0n1# cat state
>>> DYING|BYPASS|NOMERGES|SAME_COMP|NONROT|IO_STAT|DISCARD|NOXMERGES|INIT_DONE|NO_SG_MERGE|POLL|WC|FUA|STATS|QUIESCED
>>>
>>> We can see there were reqs on ctx rq_list and the request_queue is QUIESCED.
>>>
>>> Thanks again !!
>>> Jianchao
>>>
>>> On 04/22/2018 10:48 PM, Max Gurtovoy wrote:
>>>>
>>>>
>>>> On 4/22/2018 5:25 PM, jianchao.wang wrote:
>>>>> Hi Max
>>>>>
>>>>> No, I only tested it on PCIe one.
>>>>> And sorry for that I didn't state that.
>>>>
>>>> Please send your exact test steps and we'll run it using RDMA transport.
>>>> I also want to run a mini regression on this one since it may effect other flows.
>>>>
>>>>>
>>>>> Thanks
>>>>> Jianchao
>>>>>
>>>>> On 04/22/2018 10:18 PM, Max Gurtovoy wrote:
>>>>>> Hi Jianchao,
>>>>>> Since this patch is in the core, have you tested it using some fabrics drives too ? RDMA/FC ?
>>>>>>
>>>>>> thanks,
>>>>>> Max.
>>>>>>
>>>>>> On 4/22/2018 4:32 PM, jianchao.wang wrote:
>>>>>>> Hi keith
>>>>>>>
>>>>>>> Would you please take a look at this patch.
>>>>>>>
>>>>>>> This issue could be reproduced easily with a driver bind/unbind loop,
>>>>>>> a reset loop and a IO loop at the same time.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Jianchao
>>>>>>>
>>>>>>> On 04/19/2018 04:29 PM, Jianchao Wang wrote:
>>>>>>>> There is race between nvme_remove and nvme_reset_work that can
>>>>>>>> lead to io hang.
>>>>>>>>
>>>>>>>> nvme_removeÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ nvme_reset_work
>>>>>>>> -> change state to DELETING
>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ -> fail to change state to LIVE
>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ -> nvme_remove_dead_ctrl
>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ -> nvme_dev_disable
>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ -> quiesce request_queue
>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ -> queue remove_work
>>>>>>>> -> cancel_work_sync reset_work
>>>>>>>> -> nvme_remove_namespaces
>>>>>>>> ÂÂÂÂ -> splice ctrl->namespaces
>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ nvme_remove_dead_ctrl_work
>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ -> nvme_kill_queues
>>>>>>>> ÂÂÂÂ -> nvme_ns_removeÂÂÂÂÂÂÂÂÂÂÂÂÂÂ do nothing
>>>>>>>> ÂÂÂÂÂÂ -> blk_cleanup_queue
>>>>>>>> ÂÂÂÂÂÂÂÂ -> blk_freeze_queue
>>>>>>>> Finally, the request_queue is quiesced state when wait freeze,
>>>>>>>> we will get io hang here.
>>>>>>>>
>>>>>>>> To fix it, unquiesce the request_queue directly before nvme_ns_remove.
>>>>>>>> We have spliced the ctrl->namespaces, so nobody could access them
>>>>>>>> and quiesce the queue any more.
>>>>>>>>
>>>>>>>> Signed-off-by: Jianchao Wang <jianchao.w.wang@xxxxxxxxxx>
>>>>>>>> ---
>>>>>>>> ÂÂÂ drivers/nvme/host/core.c | 9 ++++++++-
>>>>>>>> ÂÂÂ 1 file changed, 8 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>>>>>> index 9df4f71..0e95082 100644
>>>>>>>> --- a/drivers/nvme/host/core.c
>>>>>>>> +++ b/drivers/nvme/host/core.c
>>>>>>>> @@ -3249,8 +3249,15 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
>>>>>>>> ÂÂÂÂÂÂÂ list_splice_init(&ctrl->namespaces, &ns_list);
>>>>>>>> ÂÂÂÂÂÂÂ up_write(&ctrl->namespaces_rwsem);
>>>>>>>> ÂÂÂ -ÂÂÂ list_for_each_entry_safe(ns, next, &ns_list, list)
>>>>>>>> +ÂÂÂ /*
>>>>>>>> +ÂÂÂÂ * After splice the namespaces list from the ctrl->namespaces,
>>>>>>>> +ÂÂÂÂ * nobody could get them anymore, let's unquiesce the request_queue
>>>>>>>> +ÂÂÂÂ * forcibly to avoid io hang.
>>>>>>>> +ÂÂÂÂ */
>>>>>>>> +ÂÂÂ list_for_each_entry_safe(ns, next, &ns_list, list) {
>>>>>>>> +ÂÂÂÂÂÂÂ blk_mq_unquiesce_queue(ns->queue);
>>>>>>>> ÂÂÂÂÂÂÂÂÂÂÂ nvme_ns_remove(ns);
>>>>>>>> +ÂÂÂ }
>>>>>>>> ÂÂÂ }
>>>>>>>> ÂÂÂ EXPORT_SYMBOL_GPL(nvme_remove_namespaces);
>>>>>>>> ÂÂ
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Linux-nvme mailing list
>>>>>>> Linux-nvme@xxxxxxxxxxxxxxxxxxx
>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e=
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Linux-nvme mailing list
>>>>>> Linux-nvme@xxxxxxxxxxxxxxxxxxx
>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=eQ9q70WFDS-d0s-KndBw8MOJvcBM6wuuKUNklqTC3h8&s=oBasfz9JoJw4yQF4EaWcNfKChZ1HMCkfHVZqyjvYVHQ&e=
>>>>>>
>>>
>>> _______________________________________________
>>> Linux-nvme mailing list
>>> Linux-nvme@xxxxxxxxxxxxxxxxxxx
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=ivKrSi6co_9l91-czj52P2ABAPyKOwv9GrD3UeCbhio&s=2U00IBVk4CVt0Q-jiZX3sEWEEA2IleOjPU_PD38Fgvg&e=
>>>
>