[Question] pcie_do_recovery and pci_enable_sriov deadlock problem

From: liwei (JK)
Date: Mon Mar 31 2025 - 07:44:08 EST


Hi, Bjorn

I have encountered a PCI-related deadlock issue triggered by a NONFATAL
AER event during the kdump kernel boot process. However, I have not yet
devised a suitable fix for this problem and would appreciate your
guidance in resolving it. Could you please assist me with this?

The deadlock description is as follows:
When a device is added to the delay_probe_pending_list, the
pci_enable_sriov function is called in the probe interface of struct
pci_driver, if the device triggers an AER NONFATAL event and this
process occurs during the kdump boot sequence, a deadlock will arise.

The deferred_probe_work side is:

deferred_probe_work_func
...
__device_attach
device_lock # hold the device_lock
...
pci_enable_sriov
sriov_enable
...
pci_device_add
down_write(&pci_bus_sem) # wait for the pci_bus_sem

The AER side is:

pcie_do_recovery
pci_walk_bus
down_read(&pci_bus_sem) # hold the pci_bus_sem
report_normal_detected
device_lock # wait for device_unlock()


This issue was reported by Jay Fang <f.fangjian@xxxxxxxxxx> in 2019.
Reference link: https://lore.kernel.org/linux-pci/bdfaaa34-3d3d-ad9a-4e24-4be97e85d216@xxxxxxxxxx/T/#mcb7dfafd0f76beaddfc9f56a71aee6d984ed4a7f

Thanks,
Xiangwei Li