Re: [PATCH 1/1] nvme-pci: fix hang during error recovery when the PCI device is isolated

From: Niklas Schnelle
Date: Tue Jul 12 2022 - 10:37:04 EST


On Tue, 2022-07-12 at 08:17 -0600, Keith Busch wrote:
> On Tue, Jul 12, 2022 at 02:44:53PM +0200, Niklas Schnelle wrote:
> > On s390 and powerpc PCI devices are isolated when an error is detected
> > and driver->err_handler->error_detected is called with an inaccessible
> > PCI device and PCI channel state set to pci_channel_io_frozen
> > (see Step 1 in Documentation/PCI/pci-error-recovery.rst).
> >
> > In the case of NVMe devices nvme_error_detected() then calls
> > nvme_dev_disable(dev, false) and requests a reset. After a successful
> > reset the device is accessible again and nvme_slot_reset() resets the
> > controller and queues nvme_reset_work() which then recovers the
> > controller.
> >
> > Since commit b98235d3a471 ("nvme-pci: harden drive presence detect in
> > nvme_dev_disable()") however nvme_dev_disable() no longer freezes the
> > queues if pci_device_is_present() returns false. This is the case for an
> > isolated PCI device. In principle this makes sense as there are no
> > accessible hardware queues to run. The problem though is that for
> > a previously live reset controller with online queues nvme_reset_work()
> > calls nvme_wait_freeze() which, without the freeze having been
> > initiated, then hangs forever. Fix this by starting the freeze in
> > nvme_slot_reset() which is the earliest point where we know the device
> > should be accessible again.
> >
> > Fixes: b98235d3a471 ("nvme-pci: harden drive presence detect in nvme_dev_disable()")
> > Signed-off-by: Niklas Schnelle <schnelle@xxxxxxxxxxxxx>
>
> Oh, we've messed up the expected sequence. The mistaken assumption is a device
> not present means we're about to unbind from it, but it could appear that way
> just for normal error handling and reset, so we need to preserve the previous
> handling.
>
> The offending commit really just wants to avoid the register access (which we
> shouldn't have to do, but hey, broken hardware...). So let's keep the sequence
> the same as before and just skip the register read. Does this work for you?

Ah thanks for the explanation! I had actually tested a similar patch
but wasn't sure if nvme_start_freeze() also does register access for
starting the HW queues and if it makes sense on a dead/isolated device
at all. On the other hand this code very explicitly handles dead
devices so I guess this was kept in mind.

So yes the below patch works for me.

>
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index fdfee3e590db..c40e82cee735 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> static void nvme_dev_remove_admin(struct nvme_dev *dev)
> @@ -2690,9 +2772,11 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> struct pci_dev *pdev = to_pci_dev(dev->dev);
>
> mutex_lock(&dev->shutdown_lock);
> - if (pci_device_is_present(pdev) && pci_is_enabled(pdev)) {
> - u32 csts = readl(dev->bar + NVME_REG_CSTS);
> + if (pci_is_enabled(pdev)) {
> + u32 csts = ~0;
>
> + if (pci_device_is_present(pdev))
> + csts = readl(dev->bar + NVME_REG_CSTS);
> if (dev->ctrl.state == NVME_CTRL_LIVE ||
> dev->ctrl.state == NVME_CTRL_RESETTING) {
> freeze = true;
> --