Re: [Bug Report] nvme connect deadlock in allocating tag

From: Sagi Grimberg
Date: Sun Apr 28 2024 - 04:10:59 EST

Next message: Ingo Molnar: "Re: [tip: sched/urgent] sched/isolation: Fix boot crash when maxcpus < first housekeeping CPU"
Previous message: David Rientjes: "Re: [PATCH] mm/slub: mark racy access on slab->freelist"
In reply to: kwb: "[Bug Report] nvme connect deadlock in allocating tag"
Next in thread: Sagi Grimberg: "Re: [Bug Report] nvme connect deadlock in allocating tag"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 28/04/2024 9:31, kwb wrote:

Hi,
We found nvme connect will dealock when it cannot alloc tag in admin queue. So we reproduce it and find a way to work around. The solution is to utilize reserve tag for connecting.
Here is the deadlock environment:
1. the process [kworker/u129:1+nvme-wq] want to connect wait for geting tag, but tag is used up:
[<0>] blk_mq_get_tag+0x11d/0x2d0
[<0>] __blk_mq_alloc_request+0x92/0x180
[<0>] blk_mq_alloc_request+0x7c/0xc0
[<0>] nvme_alloc_request+0x28/0x100 [nvme_core]
[<0>] __nvme_submit_sync_cmd+0x1ea/0x230 [nvme_core]
[<0>] nvmf_reg_read64+0x62/0xa0 [nvme_fabrics]
[<0>] nvme_enable_ctrl+0x25/0xb0 [nvme_core]
[<0>] nvme_tcp_setup_ctrl+0x257/0x340 [nvme_tcp]
[<0>] nvme_tcp_reconnect_ctrl_work+0x24/0x40 [nvme_tcp]
[<0>] process_one_work+0x228/0x3d0
[<0>] worker_thread+0x4d/0x3f0
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x1f/0x30
2. many processes (here is nvme list) is waiting for connecting:
[<0>] blk_execute_rq+0x8d/0x110
[<0>] nvme_execute_passthru_rq+0x60/0x1f0 [nvme_core]
[<0>] nvme_submit_user_cmd+0x23e/0x400 [nvme_core]
[<0>] nvme_user_cmd+0x163/0x1d0 [nvme_core]
[<0>] nvme_ctrl_ioctl+0x2e/0x40 [nvme_core]
[<0>] __nvme_ioctl+0x78/0xc0 [nvme_core]
[<0>] nvme_ioctl+0x1e/0x20 [nvme_core]
[<0>] blkdev_ioctl+0x126/0x260
[<0>] block_ioctl+0x4a/0x60
[<0>] __x64_sys_ioctl+0x91/0xc0
[<0>] do_syscall_64+0x59/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

Reproduce method is very eazy:
1. call many nvme list
2. make nvme io timeout to recover connection
3. trick is to make reconnect-delay much time, eg:30s

The solution is the appending patch. it is tested and also consider keepalive and reset/showdown tag reserve.

The error_recovery work should unquiesce the admin_q, which should fail fast all pending admin commands,
so it is unclear to me how the connect process gets stuck.

What is step (2) - make nvme io timeout to recover the connection?

Is this reproducing with upstream nvme? or is this some distro kernel where this happens?
Do you have the below patch applied?
de105068fead ("nvme: fix reconnection fail due to reserved tag allocation")

Next message: Ingo Molnar: "Re: [tip: sched/urgent] sched/isolation: Fix boot crash when maxcpus < first housekeeping CPU"
Previous message: David Rientjes: "Re: [PATCH] mm/slub: mark racy access on slab->freelist"
In reply to: kwb: "[Bug Report] nvme connect deadlock in allocating tag"
Next in thread: Sagi Grimberg: "Re: [Bug Report] nvme connect deadlock in allocating tag"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]