Re: [PATCH v4 3/3] nvme-fc: do not ignore connectivity loss during connecting
From: Daniel Wagner
Date: Thu Feb 13 2025 - 04:18:44 EST
On Thu, Feb 13, 2025 at 07:16:31AM +0000, Shinichiro Kawasaki wrote:
> On Jan 09, 2025 / 14:30, Daniel Wagner wrote:
> > When a connectivity loss occurs while nvme_fc_create_assocation is
> > being executed, it's possible that the ctrl ends up stuck in the LIVE
> > state:
> >
> [...]
> >
> > There is already the ASSOC_FAILED flag to track connectivity loss event
> > but this bit is set too late in the recovery code path. Move this into
> > the connectivity loss event handler and synchronize it with the state
> > change. This ensures that the ASSOC_FAILED flag is seen by
> > nvme_fc_create_io_queues and it does not enter the LIVE state after a
> > connectivity loss event. If the connectivity loss event happens after we
> > entered the LIVE state the normal error recovery path is executed.
>
> I found many test cases in blktests nvme group fail with v6.14-rc2 kernel with
> fc transport. I bisected and found this patch, as the commit ee59e3820ca9, is
> the trigger. When I revert the commit from v6.14-rc2, the failure disappears.
>
> Here I share the kernel message log observed at the test case nvme/003. The
> kernel reported "BUG: sleeping function called from invalid context".
Thanks for digging into it.
The problem is that the patch moves the nvme_change_ctrl_state call
under the ctrl->lock and nvme_change_ctrl_state is also
cancel_delayed_work_sync which doesn't work:
spin_lock_irqsave
nvme_change_ctrl_state
nvme_stop_failfast_work
cancel_delayed_work_sync
I've checked if there is another caller of nvme_change_ctrl_state is
taking a lock, all looks fine. If there would be way to move the
failfast code bits out of the nvme_change_ctrl_state function would be a
nice solution.
That means the idea to synchronize the state change with the
ASSOC_FAILED bit under the lock is not going to work. I am trying to
figure out a solution.