On Thu, Jun 04, 2020 at 02:50:01PM -0700, sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx wrote:Ok. I will fix it in next version.
From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>
Fatal (DPC) error recovery is currently broken for non-hotplug
capable devices. With current implementation, after successful
fatal error recovery, non-hotplug capable device state won't be
restored properly. You can find related issues in following links.
https://lkml.org/lkml/2020/5/27/290
https://lore.kernel.org/linux-pci/12115.1588207324@famine/
https://lkml.org/lkml/2020/3/28/328
Can you please convert these all to lore.kernel.org links? lkml.org
is not quite as useful or reliable.
In case of platform that supports PCIe native hotplug, once the fatal
Current fatal error recovery implementation relies on hotplug handler
for detaching/re-enumerating the affected devices/drivers on DLLSC
state changes.
Can you remind us exactly how this relies on hotplug? I know it
*does*, but I can't remember how. It would sure be nice if we could
decouple this from pciehp somehow.
Device will not be accessible. AFAIK, doing IO should fail.
So when dealing with non-hotplug capable devices,
recovery code does not restore the state of the affected devices
correctly. Correct implementation should call report_slot_reset()
function after resetting the link to restore the state of the
device/driver.
We don't restore the state correctly. What does this look like to the
user? Does the device not work?
For fatal errors, since the reset is not triggered by OS, we cannot save
So use PCI_ERS_RESULT_NEED_RESET as error status for successful
reset_link() operation and use PCI_ERS_RESULT_DISCONNECT for failure
case. PCI_ERS_RESULT_NEED_RESET error state will ensure slot_reset()
is called after reset link operation which will also fix the above
mentioned issue.
I think PCI_ERS_RESULT_NEED_RESET results in calling driver
->slot_reset() callbacks, right? Where does the state restoration
happen?
For hotplug capable devices, driver is removed and reattached (on DLLSC
No, I guess it must be something in the hotplug driver that restores
the state, because you said devices below hotplug-capable ports work
correctly, but others don't.
Yes, but I am trying to explain why we ignore the status.
[original patch is from jay.vosburgh@xxxxxxxxxxxxx]
[original patch link https://lore.kernel.org/linux-pci/12115.1588207324@famine/]
Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
Signed-off-by: Jay Vosburgh <jay.vosburgh@xxxxxxxxxxxxx>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>
---
drivers/pci/pcie/err.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 14bb8f54723e..5fe8561c7185 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -165,8 +165,28 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
pci_dbg(dev, "broadcast error_detected message\n");
if (state == pci_channel_io_frozen) {
pci_walk_bus(bus, report_frozen_detected, &status);
- status = reset_link(dev);
- if (status != PCI_ERS_RESULT_RECOVERED) {
+ /*
+ * After resetting the link using reset_link() call, the
+ * possible value of error status is either
+ * PCI_ERS_RESULT_DISCONNECT (failure case) or
+ * PCI_ERS_RESULT_NEED_RESET (success case).
+ * So ignore the return value of report_error_detected()
+ * call for fatal errors. Instead use
+ * PCI_ERS_RESULT_NEED_RESET as initial status value.
+ *
+ * Ignoring the status return value of report_error_detected()
+ * call will also help in case of EDR mode based error
+ * recovery. In EDR mode AER and DPC Capabilities are owned by
+ * firmware and hence report_error_detected() call will possibly
+ * return PCI_ERS_RESULT_NO_AER_DRIVER. So if we don't ignore
+ * the return value of report_error_detected() then
+ * pcie_do_recovery() would report incorrect status after
+ * successful recovery. Ignoring PCI_ERS_RESULT_NO_AER_DRIVER
+ * in non EDR case should not have any functional impact.
I can't make sense out of the comment. We already ignore the "status"
from pci_walk_bus(bus, report_frozen_detected, &status).
Following are more details related to second part of comment. Let me
No idea what to make of the second paragraph. If we make the commit
log make sense, maybe some summary of that would be useful here.
Ok. I will change to this logic in next version.
I think this code is equivalent and makes the patch much clearer:
status = reset_link(dev);
if (status == PCI_ERS_RESULT_RECOVERED) {
status = PCI_ERS_RESULT_NEED_RESET;
} else {
status = PCI_ERS_RESULT_DISCONNECT;
goto failed;
}
+ */
+ status = PCI_ERS_RESULT_NEED_RESET;
+ if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) {
+ status = PCI_ERS_RESULT_DISCONNECT;
pci_warn(dev, "link reset failed\n");
goto failed;
}
--
2.17.1