Re: [PATCH V10 2/3] PCI: handle CRS returned by device after FLR

From: Sinan Kaya
Date: Mon Aug 21 2017 - 09:44:20 EST

Next message: Juri Lelli: "Re: [PATCH v8 1/2] sched/deadline: Add support for SD_PREFER_SIBLING on find_later_rq()"
Previous message: Peter Zijlstra: "Re: [rfc patch] sched/topology: fix domain reconstruction memory leakage"
In reply to: Bjorn Helgaas: "Re: [PATCH V10 2/3] PCI: handle CRS returned by device after FLR"
Next in thread: Bjorn Helgaas: "Re: [PATCH V10 2/3] PCI: handle CRS returned by device after FLR"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Bjorn,

On 8/18/2017 5:01 PM, Bjorn Helgaas wrote:
> On Fri, Aug 11, 2017 at 12:56:35PM -0400, Sinan Kaya wrote:
>> Sporadic reset issues have been observed with Intel 750 NVMe drive while
>> assigning the physical function to the guest machine. The sequence of
>> events observed is as follows:
>>
>> - perform a Function Level Reset (FLR)
>> - sleep up to 1000ms total
>> - read ~0 from PCI_COMMAND
>> - warn that the device didn't return from FLR
>> - touch the device before it's ready
>> - drop register read and writes performing register settings restore
>> - incomplete reset operation and partial register restoration
>> - second time device probe fails in the guest machine as HW is left in
>> limbo.
>
> Pardon me while I think out loud for a while. It's taking me a while
> to untangle my confusion about how CRS works.
>

no problem,

> Prior to this patch, you saw the "Failed to return from FLR" warning.
> That means we did this:
>
> 0ms assert FLR
> 100ms read PCI_COMMAND, get ~0 (i == 0)
> 200ms read PCI_COMMAND, get ~0 (i == 1)
> ...
> 1000ms read PCI_COMMAND, get ~0 (i == 9)
>
> If we did not get completions for those config reads, the root port
> would complete the read as a failed transaction and probably
> synthesize ~0 data.

That's correct. This root port returns ~0 when destination is unreachable.

> But if that were the case, this patch would make
> no difference: it reads PCI_VENDOR_ID instead of PCI_COMMAND the first
> time around, that would also be a failed transaction, and we wouldn't
> interpret it as a CRS status, so the sequence would be exactly the
> above except the first read would be of PCI_VENDOR_ID, not
> PCI_COMMAND.
>
> Since this patch *does* make a difference, CRS Software Visibility
> must be supported and enabled, and we must have gotten completions
> with CRS status for the PCI_COMMAND reads.

That's right, CRS visibility is supported and enabled. Root port returns
0xFFFF0001 when vendor ID is read as per the spec.

> Per the completion
> handling rules in sec 2.3.2, the root complex should transparently
> retry the read (since we're not reading PCI_VENDOR_ID, CRS Software
> Visibility doesn't apply, so this is invisible to software). But the
> root complex *is* allowed to limit the number of retries and if the
> limit is reached, complete the request as a failed transaction.

This version of the root port doesn't support retry mechanism. This is a
TO-DO for the hardware team. This root port returns ~0 for all accesses
other than vendor id. This means we waited 1 seconds and the device
was not accessible. Command register was not reachable at the end of
1 seconds.

>
> That must be what happened before this patch. If that's the case, we
> should see an error logged from the failed transaction. We should be
> able to verify this if we collected "lspci -vv" output for the system
> before and after the FLR.
>

See above.

> If CRS SV is not present (or not enabled), the PCI_VENDOR_ID read will
> still get a CRS completion, but the root port will retry it instead of
> returning the 0x0001 ID. When we hit the retry limit, the root port
> will synthesize ~0 data to complete the read.

This is what is missing from the HW. I'm trying to get this corrected
to be complete.

>
> That means we won't call pci_bus_wait_crs() at all, and we'll fall
> back to the original PCI_COMMAND reads. That path will only wait
> 1000ms, just like the original code before this patch. So I *think*
> that if you disable CRS SV on this system (or if you put the device in
> a system that doesn't support CRS SV), you'll probably see the same
> "failed to return from FLR" warning.
>
> You could easily test this by using setpci to turn off
> PCI_EXP_RTCTL_CRSSVE in the root port leading to the NVMe device
> before the FLR.
>
> I think the implication is that we need to generalize
> pci_bus_wait_crs() to be more of a "wait for device ready" interface
> that can use CRS SV (if supported and enabled) or something like the
> PCI_COMMAND loop (if no CRS SV). It should use the same timeout
> either way, since CRS SV is a property of the root port, not of the
> device we're waiting for.

I saw your comment about timeout. I was trying not to change the behavior
for systems that do not support CRS visibility. we can certainly increase
the timeout if you think it is better.

I saw your V11, I'll review them in a minute.

>
> It's "interesting" that the PCI core error handling code doesn't look
> at the Master Abort bit at all, even though it's pretty fundamental to
> conventional PCI error reporting. I suspect Master Abort gets set
> whenever a root port synthesizes ~0 data, but (with the exception of
> some arch code) we never look at it and never clear it.
>
>> An endpoint is allowed to issue Configuration Request Retry Status (CRS)
>> following a FLR request to indicate that it is not ready to accept new
>> requests. CRS is defined in PCIe r3.1, sec 2.3.1. Request Handling Rules
>> and CRS usage in FLR context is mentioned in PCIe r3.1, sec 6.6.2.
>> Function-Level Reset.
>>
>> A CRS indication will only be given if the address to be read is vendor ID
>> register. pci_bus_read_dev_vendor_id() knows how to deal with CRS returned
>> 0xFFFF0001 value and will continue polling until a value other than
>> 0xFFFF0001 is returned within a given timeout.
>>
>> Try to discover device presence via CRS first. If device is not found, fall
>> through to old behavior.
>>
>> Signed-off-by: Sinan Kaya <okaya@xxxxxxxxxxxxxx>
>> ---
>> drivers/pci/pci.c | 23 +++++++++++++++++++++--
>> 1 file changed, 21 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index af0cc34..c853551 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -3821,17 +3821,36 @@ static void pci_flr_wait(struct pci_dev *dev)
>> {
>> int i = 0;
>> u32 id;
>> + bool ret;
>> +
>> + /*
>> + * Don't touch the HW before waiting 100ms. HW has to finish
>> + * within 100ms according to PCI Express Base Specification
>> + * Revision 3.1 Section 6.6.2: Function-Level Reset (FLR).
>> + */
>> + msleep(100);
>> +
>> + if (pci_bus_read_config_dword(dev->bus, dev->devfn, PCI_VENDOR_ID,
>> + &id))
>> + return;
>> +
>> + /* See if we can find a device via CRS first. */
>> + if ((id & 0xffff) == 0x0001) {
>> + ret = pci_bus_wait_crs(dev->bus, dev->devfn, &id, 60000);
>> + if (ret)
>> + return;
>> + }
>>
>> do {
>> msleep(100);
>> pci_read_config_dword(dev, PCI_COMMAND, &id);
>> - } while (i++ < 10 && id == ~0);
>> + } while (i++ < 9 && id == ~0);
>>
>> if (id == ~0)
>> dev_warn(&dev->dev, "Failed to return from FLR\n");
>> else if (i > 1)
>> dev_info(&dev->dev, "Required additional %dms to return from FLR\n",
>> - (i - 1) * 100);
>> + i * 100);
>> }
>>
>> /**
>> --
>> 1.9.1
>>
>

--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

Next message: Juri Lelli: "Re: [PATCH v8 1/2] sched/deadline: Add support for SD_PREFER_SIBLING on find_later_rq()"
Previous message: Peter Zijlstra: "Re: [rfc patch] sched/topology: fix domain reconstruction memory leakage"
In reply to: Bjorn Helgaas: "Re: [PATCH V10 2/3] PCI: handle CRS returned by device after FLR"
Next in thread: Bjorn Helgaas: "Re: [PATCH V10 2/3] PCI: handle CRS returned by device after FLR"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]