Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

From: Kuppuswamy, Sathyanarayanan
Date: Sun Sep 27 2020 - 22:43:56 EST


Hi,

On 9/25/20 11:30 AM, Sinan Kaya wrote:
On 9/25/2020 2:16 PM, Kuppuswamy, Sathyanarayanan wrote:

If this is a too involved change, DPC driver should restore state
when hotplug is not supported.
Yes. we can add a condition for hotplug capability check.

DPC driver should be self-sufficient by itself.


Sounds good.

Also for non-fatal errors, if reset is requested then we still need
some kind of bus reset call here

DPC should handle both fatal and non-fatal cases
Currently DPC is only triggered for FATAL errors.
 and cause a bus reset

Thanks for the heads up.
This seems to have changed since I looked at the DPC code.

in hardware already before triggering an interrupt.
Error recovery is not triggered only DPC driver. AER also uses the
same error recovery code. If DPC is not supported, then we still need
reset logic.

It sounds like we are cross-talking two issues.

1. no state restore on DPC after FATAL error.
Let's fix this.
Agree. Few more detail about the above issue is,

There are two cases under FATAL error.

FATAL + hotplug - In this case, link will be reseted. And hotplug handler
will remove the driver state. This case works well with current code.

FATAL + no-hotplug - In this case, link will still be reseted. But
currently driver state is not properly restored. So I attempted
to restore it using pci_reset_bus().

status = reset_link(dev);
- if (status != PCI_ERS_RESULT_RECOVERED) {
+ if (status == PCI_ERS_RESULT_RECOVERED) {
+ status = PCI_ERS_RESULT_NEED_RESET;

...

if (status == PCI_ERS_RESULT_NEED_RESET) {
/*
- * TODO: Should call platform-specific
- * functions to reset slot before calling
- * drivers' slot_reset callbacks?
+ * TODO: Optimize the call to pci_reset_bus()
+ *
+ * There are two components to pci_reset_bus().
+ *
+ * 1. Do platform specific slot/bus reset.
+ * 2. Save/Restore all devices in the bus.
+ *
+ * For hotplug capable devices and fatal errors,
+ * device is already in reset state due to link
+ * reset. So repeating platform specific slot/bus
+ * reset via pci_reset_bus() call is redundant. So
+ * can optimize this logic and conditionally call
+ * pci_reset_bus().
*/
+ pci_reset_bus(dev);


2. no bus reset on NON_FATAL error through AER driver path.
This already tells me that you need to split your change into
multiple patches.

Let's talk about this too. bus reset should be triggered via
AER driver before informing the recovery.
But as per error recovery documentation, any call to
->error_detected() or ->mmio_enabled() can request
PCI_ERS_RESULT_NEED_RESET. So we need to add code
to do the actual reset before calling ->slot_reset()
callback. So call to pci_reset_bus() fixes this
issue.

if (status == PCI_ERS_RESULT_NEED_RESET) {
+ pci_reset_bus(dev);



if (status == PCI_ERS_RESULT_NEED_RESET) {
/*
* TODO: Should call platform-specific
* functions to reset slot before calling
* drivers' slot_reset callbacks?
*/
status = PCI_ERS_RESULT_RECOVERED;
pci_dbg(dev, "broadcast slot_reset message\n");
pci_walk_bus(bus, report_slot_reset, &status);
}


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer