On 28/04/2024 12:16, Wangbing Kuang wrote:nvme_cancel_admin_tagset can cancel requests before stop admin queue, but
"The error_recovery work should unquiesce the admin_q, which should failThe error recovery also cancels all pending requests. See
fast all pending admin commands,
so it is unclear to me how the connect process gets stuck."
I think the reason is: the command can be unquiesce but the tag cannot be
return until command success.
nvme_cancel_admin_tagset
cannot cancel requests before next reconnect time.
The time line is:
recover failed(we can reproduce by hang io for more time)
-> reconnect delay
-> multi nvme list issue(used up tagset)
-> reconnect start(wait for tag when call nvme_enabel_ctrl and nvme_wait_ready)
No certain, I did not test on multipath=Y.We choose multipath=0 cos less code and we need only one path"What is step (2) - make nvme io timeout to recover the connection?"Interesting, does this happen with multipath=Y ?
I use spdk-nvmf-target for backend. It is easy to set read/write
nvmf-target io hang and unhang. So I just set the io hang for over 30
seconds, then trigger linux-nvmf-host trigger io timeout event. then io
timeout will trigger connection recover.
by the way, I use multipath=0
I didn't expect people to be using multipath=0 for fabrics in the past few
years.
ok, test need more time, but we can first verify it only in v5.15."Is this reproducing with upstream nvme? or is this some distro kernelIt would be beneficial to verify this.
where this happens?"
it is reproduced in a kernel based from v5.15, but I think this is common
error.