Re: [PATCH v12 10/15] PCI/ERR: Limit AER resets in pcie_do_recovery()

From: Kelley, Sean V
Date: Mon Nov 30 2020 - 20:14:54 EST




> On Nov 30, 2020, at 4:25 PM, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> On Mon, Nov 30, 2020 at 07:54:37PM +0000, Kelley, Sean V wrote:
>>> On Nov 24, 2020, at 9:17 AM, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>>> On Mon, Nov 23, 2020 at 11:57:35PM +0000, Kelley, Sean V wrote:
>>>>> On Nov 23, 2020, at 3:28 PM, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>>>>> On Fri, Nov 20, 2020 at 04:10:31PM -0800, Sean V Kelley wrote:
>>>>>> In some cases a bridge may not exist as the hardware controlling may be
>>>>>> handled only by firmware and so is not visible to the OS. This scenario is
>>>>>> also possible in future use cases involving non-native use of RCECs by
>>>>>> firmware.
>>>>>>
>>>>>> Explicitly apply conditional logic around these resets by limiting them to
>>>>>> Root Ports and Downstream Ports.
>>>>>
>>>>> Can you help me understand this? The subject says "Limit AER resets"
>>>>> and here you say "limit them to RPs and DPs", but it's not completely
>>>>> obvious how the resets are being limited, i.e., the patch doesn't add
>>>>> anything like:
>>>>>
>>>>> + if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>>>> + type == PCI_EXP_TYPE_DOWNSTREAM)
>>>>> reset_subordinates(bridge);
>>>>>
>>>>> It *does* add checks around pcie_clear_device_status(), but that also
>>>>> includes RC_EC. And that's not a reset, so I don't think that's
>>>>> explicitly mentioned in the commit log.
>>>>
>>>> The subject should have referred to the clearing of the device status rather than resets.
>>>> It originally came from this simpler patch in which I made use of reset instead of clear:
>>>>
>>>> https://lore.kernel.org/linux-pci/20201002184735.1229220-8-seanvk.dev@xxxxxxxxxxxxxxxx/
>>>>
>>>> So a rephrase of clearing in place of resets would be more appropriate.
>>>>
>>>> Then we added the notion of bridges…below
>>>>
>>>>>
>>>>> Also see the question below.
>>>>>
>>>>>> Link: https://lore.kernel.org/r/20201002184735.1229220-8-seanvk.dev@xxxxxxxxxxxxxxxx
>>>>>> Signed-off-by: Sean V Kelley <sean.v.kelley@xxxxxxxxx>
>>>>>> Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
>>>>>> Acked-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
>>>>>> ---
>>>>>> drivers/pci/pcie/err.c | 31 +++++++++++++++++++++++++------
>>>>>> 1 file changed, 25 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>>>>> index 8b53aecdb43d..7883c9791562 100644
>>>>>> --- a/drivers/pci/pcie/err.c
>>>>>> +++ b/drivers/pci/pcie/err.c
>>>>>> @@ -148,13 +148,17 @@ static int report_resume(struct pci_dev *dev, void *data)
>>>>>>
>>>>>> /**
>>>>>> * pci_walk_bridge - walk bridges potentially AER affected
>>>>>> - * @bridge: bridge which may be a Port
>>>>>> + * @bridge: bridge which may be a Port, an RCEC with associated RCiEPs,
>>>>>> + * or an RCiEP associated with an RCEC
>>>>>> * @cb: callback to be called for each device found
>>>>>> * @userdata: arbitrary pointer to be passed to callback
>>>>>> *
>>>>>> * If the device provided is a bridge, walk the subordinate bus, including
>>>>>> * any bridged devices on buses under this bus. Call the provided callback
>>>>>> * on each device found.
>>>>>> + *
>>>>>> + * If the device provided has no subordinate bus, call the callback on the
>>>>>> + * device itself.
>>>>>> */
>>>>>> static void pci_walk_bridge(struct pci_dev *bridge,
>>>>>> int (*cb)(struct pci_dev *, void *),
>>>>>> @@ -162,6 +166,8 @@ static void pci_walk_bridge(struct pci_dev *bridge,
>>>>>> {
>>>>>> if (bridge->subordinate)
>>>>>> pci_walk_bus(bridge->subordinate, cb, userdata);
>>>>>> + else
>>>>>> + cb(bridge, userdata);
>>>>>> }
>>>>>>
>>>>>> pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>>>> @@ -174,10 +180,13 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>>>>
>>>>>> /*
>>>>>> * Error recovery runs on all subordinates of the bridge. If the
>>>>>> - * bridge detected the error, it is cleared at the end.
>>>>>> + * bridge detected the error, it is cleared at the end. For RCiEPs
>>>>>> + * we should reset just the RCiEP itself.
>>>>>> */
>>>>>> if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>>>>> - type == PCI_EXP_TYPE_DOWNSTREAM)
>>>>>> + type == PCI_EXP_TYPE_DOWNSTREAM ||
>>>>>> + type == PCI_EXP_TYPE_RC_EC ||
>>>>>> + type == PCI_EXP_TYPE_RC_END)
>>>>>> bridge = dev;
>>>>>> else
>>>>>> bridge = pci_upstream_bridge(dev);
>>>>>> @@ -185,6 +194,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>>>> pci_dbg(bridge, "broadcast error_detected message\n");
>>>>>> if (state == pci_channel_io_frozen) {
>>>>>> pci_walk_bridge(bridge, report_frozen_detected, &status);
>>>>>> + if (type == PCI_EXP_TYPE_RC_END) {
>>>>>> + pci_warn(dev, "subordinate device reset not possible for RCiEP\n");
>>>>>> + status = PCI_ERS_RESULT_NONE;
>>>>>> + goto failed;
>>>>>> + }
>>>>>> +
>>>>>> status = reset_subordinates(bridge);
>>>>>> if (status != PCI_ERS_RESULT_RECOVERED) {
>>>>>> pci_warn(bridge, "subordinate device reset failed\n");
>>>>>> @@ -217,9 +232,13 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>>>> pci_dbg(bridge, "broadcast resume message\n");
>>>>>> pci_walk_bridge(bridge, report_resume, &status);
>>>>>>
>>>>>> - if (pcie_aer_is_native(bridge))
>>>>>> - pcie_clear_device_status(bridge);
>>>>>> - pci_aer_clear_nonfatal_status(bridge);
>>>>>> + if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>>>>> + type == PCI_EXP_TYPE_DOWNSTREAM ||
>>>>>> + type == PCI_EXP_TYPE_RC_EC) {
>>>>>> + if (pcie_aer_is_native(bridge))
>>>>>> + pcie_clear_device_status(bridge);
>>>>>> + pci_aer_clear_nonfatal_status(bridge);
>>>>>
>>>>> This is hard to understand because "type" is from "dev", but "bridge"
>>>>> is not necessarily the same device. Should it be this?
>>>>>
>>>>> type = pci_pcie_type(bridge);
>>>>> if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>>>> ...)
>>>>
>>>> Correct, it would be better if the type was based on the ‘bridge’.
>>>
>>> OK. This is similar to
>>> https://lore.kernel.org/linux-pci/20201002184735.1229220-8-seanvk.dev@xxxxxxxxxxxxxxxx/,
>>> which you cited above except for the bridge/dev question and the
>>> addition here of RC_EC.
>>>
>>> I tried to split that back into its own patch and started with the
>>> commit message from that patch. But I got stuck on the commit
>>> message. I got as far as:
>>>
>>> In some cases an error may be reported by a device not visible to
>>> the OS, e.g., if firmware manages the device and passes error
>>> information to the OS via ACPI APEI.
>>>
>>> But I still can't quite connect that to the patch. "bridge" is
>>> clearly a device visible to Linux.
>>>
>>> I guess we're trying to assert that if "bridge" is not a Root Port,
>>> Downstream Port, or RCEC, we shouldn't clear the error status because
>>> the error came from a device Linux doesn't know about. But I think
>>> "bridge" is *always* either a Root Port or a Downstream Port:
>>
>> That’s ultimately what we are trying to do.
>>
>>> if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>> type == PCI_EXP_TYPE_DOWNSTREAM)
>>> bridge = dev;
>>> else
>>> bridge = pci_upstream_bridge(dev);
>>>
>>> pci_upstream_bridge() returns either NULL (in which case previous uses
>>> dereference a NULL pointer), or dev->bus->self, which is always a Root
>>> Port, Switch Downstream Port, or Switch Upstream Port (or NULL for the
>>> special case of VFs).
>>
>> In the past recall we were augmenting it with bridge = dev->rcec for
>> RC_END. But we were able to relocate the handling in
>> aer_root_reset().
>>
>> So in this patch - we add the conditionals because RC_END is being
>> passed in addition to RC_EC.
>>
>> if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>
>> - type == PCI_EXP_TYPE_DOWNSTREAM)
>>
>> + type == PCI_EXP_TYPE_DOWNSTREAM ||
>> + type == PCI_EXP_TYPE_RC_EC ||
>> + type == PCI_EXP_TYPE_RC_END)
>>
>> bridge = dev;
>> else
>> bridge = pci_upstream_bridge(dev);
>>
>> So we need to check for RP, DS, and RC_EC
>>
>> @@ -217,9 +232,13 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>
>> pci_dbg(bridge, "broadcast resume message\n");
>> pci_walk_bridge(bridge, report_resume, &status);
>>
>>
>> - if (pcie_aer_is_native(bridge))
>> - pcie_clear_device_status(bridge);
>> - pci_aer_clear_nonfatal_status(bridge);
>>
>> + if (type == PCI_EXP_TYPE_ROOT_PORT ||
>> + type == PCI_EXP_TYPE_DOWNSTREAM ||
>> + type == PCI_EXP_TYPE_RC_EC) {
>> + if (pcie_aer_is_native(bridge))
>> + pcie_clear_device_status(bridge);
>> + pci_aer_clear_nonfatal_status(bridge);
>> + }
>>
>> Breaking out a separate patch would be unnecessary as you correctly
>> point out that it’s only going to be an RP or DS before this patch.
>
> Still trying to sort this out in my head, so half-baked questions
> before I quit for the day: We call pcie_do_recovery() in four paths:
> AER, APEI, DPC, and EDR, and I'm trying to understand what "dev" we
> pass in all those cases.
>
> For DPC, I think "dev" must be a downstream port that triggered DPC,
> and its secondary link is disabled. The device and any siblings have
> already been reset by the link going down. We want to clear AER
> status in downstream device(s) after they come back up (the AER status
> bits are sticky, so they're not cleared by the reset), and the AER
> status in the downstream port.
>
> I think EDR is the same as DPC?
>
> For AER, I think "dev" will typically (maybe always?) be the device
> that detected the error and logged it in its AER Capability, not the
> Root Port or RCEC that generated the interrupt. We want to reset
> "dev" and any siblings, clear AER status in "dev", and clear AER
> status in the Root Port or RCEC.


It’s also possible for RCEC’s to have errors themselves without a corresponding end point device.


Sean


>
> For APEI, I assume "dev" is typically the device that detected the
> error, and we want to reset it and any siblings. Firmware has already
> read the AER status for "dev", and I assume firmware also clears it.
> I assume firmware is also responsible for clearing AER status in the
> Root Port, RCEC, or non-architected device that generated the
> interrupt.
>
> It seems like there are basically two devices of interest in
> pcie_do_recovery(): the endpoint where we have to call the driver
> error recovery, and the port that generated the interrupt. I wonder
> if this would make more sense if the caller passed both of them in
> explicitly instead of having pcie_do_recovery() check the type of
> "dev" and try to figure things out after the fact.
>
> Bjorn