Re: [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

From: Sinan Kaya
Date: Mon Oct 12 2020 - 10:50:37 EST


On 10/12/2020 1:03 AM, sathyanarayanan.nkuppuswamy@xxxxxxxxx wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>
>
> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> merged fatal and non-fatal error recovery paths, and also made
> recovery code depend on hotplug handler for "remove affected
> device + rescan" support. But this change also complicated the
> error recovery path and which in turn led to the following
> issues.
>
> 1. We depend on hotplug handler for removing the affected
> devices/drivers on DLLSC LINK down event (on DPC event
> trigger) and DPC handler for handling the error recovery. Since
> both handlers operate on same set of affected devices, it leads
> to race condition, which in turn leads to  NULL pointer
> exceptions or error recovery failures.You can find more details
> about this issue in following link.
>
> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.zhao@xxxxxxxxx/T/#t
>
> 2. For non-hotplug capable devices fatal (DPC) error recovery
> is currently broken. Current fatal error recovery implementation
> relies on PCIe hotplug (pciehp) handler for detaching and
> re-enumerating the affected devices/drivers. So when dealing with
> non-hotplug capable devices, recovery code does not restore the state
> of the affected devices correctly. You can find more details about
> this issue in the following links.
>
> https://lore.kernel.org/linux-pci/20200527083130.4137-1-Zhiqiang.Hou@xxxxxxx/
> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx/
>
> In order to fix the above two issues, we should stop relying on hotplug
> handler for cleaning the affected devices/drivers and let error recovery
> handler own this functionality. So this patch reverts Commit bdb5ac85777d
> ("PCI/ERR: Handle fatal error recovery") and re-introduce the  "remove
> affected device + rescan"  functionality in fatal error recovery handler.
>
> Also holding pci_lock_rescan_remove() will prevent the race between hotplug
> and DPC handler.
>
> Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>
> ---
> Documentation/PCI/pci-error-recovery.rst | 47 ++++++++++------
> drivers/pci/pcie/err.c | 71 +++++++++++++++++++-----
> 2 files changed, 87 insertions(+), 31 deletions(-)

I'm not sure about locks involved but I do like the concept.
Current fatal error handling is best effort.

There is no way to recover if link is down by the time we
reach to fatal error handling routine.

This change will make the solution more reliable.

Reviewed-by: Sinan Kaya <okaya@xxxxxxxxxx>