On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy SathyanarayananNo, as per PCIe spec, hotplug and DPC has no functional dependency. Hence
<sathyanarayanan.nkuppuswamy@xxxxxxxxx> wrote:
Yes, it doesn't rely on hotplug handler to remove and rescan the device,
Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
merged fatal and non-fatal error recovery paths, and also made
recovery code depend on hotplug handler for "remove affected
device + rescan" support. But this change also complicated the
error recovery path and which in turn led to the following
issues.
1. We depend on hotplug handler for removing the affected
devices/drivers on DLLSC LINK down event (on DPC event
trigger) and DPC handler for handling the error recovery. Since
both handlers operate on same set of affected devices, it leads
to race condition, which in turn leads to NULL pointer
exceptions or error recovery failures.You can find more details
about this issue in following link.
https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.zhao@xxxxxxxxx/T/#t
2. For non-hotplug capable devices fatal (DPC) error recovery
is currently broken. Current fatal error recovery implementation
relies on PCIe hotplug (pciehp) handler for detaching and
re-enumerating the affected devices/drivers. So when dealing with
non-hotplug capable devices, recovery code does not restore the state
of the affected devices correctly. You can find more details about
this issue in the following links.
https://lore.kernel.org/linux-pci/20200527083130.4137-1-Zhiqiang.Hou@xxxxxxx/
https://lore.kernel.org/linux-pci/12115.1588207324@famine/
https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx/
In order to fix the above two issues, we should stop relying on hotplug
but it couldn't prevent hotplug drivers from doing another replicated
removal/rescanning.
it doesn't make sense to leave another useless removal/rescanning there.
Maybe that's why these two paths were merged to one and made it rely on
hotplug.
No, the core operation (remove/add device) is syncronzied and done in+ elseThough here you have lock, but hotplug will do another
+ udev = dev->bus->self;
+
+ parent = udev->subordinate;
+ pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
+
+ pci_lock_rescan_remove();
'pci_stop_and_remove_bus_device()'
without merging it with the hotplug driver, you have no way to
remove the replicated actions in
hotplug handler.
As I have mentioned before, holding the same lock should make them synchronized
Thanks,
Ethan
+ pci_dev_get(dev);And another pci_rescan_bus() like in the hotplug handler.
+ list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
+ bus_list) {
+ pci_stop_and_remove_bus_device(pdev);
+ }
+
+ result = reset_link(udev);
+
+ if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+ /*
+ * If the error is reported by a bridge, we think this error
+ * is related to the downstream link of the bridge, so we
+ * do error recovery on all subordinates of the bridge instead
+ * of the bridge and clear the error status of the bridge.
+ */
+ pci_aer_clear_fatal_status(dev);
+ if (pcie_aer_is_native(dev))
+ pcie_clear_device_status(dev);
+ }
+
+ if (result == PCI_ERS_RESULT_RECOVERED) {
+ if (pcie_wait_for_link(udev, true))
+ pci_rescan_bus(udev->bus);
+ pci_info(dev, "Device recovery from fatal error successful\n");
+ } else {
+ pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
+ pci_info(dev, "Device recovery from fatal error failed\n");
--
2.17.1