Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system

From: Sinan Kaya
Date: Thu Apr 12 2018 - 10:34:48 EST

Next message: Lukasz Majewski: "[PATCH] doc: Add vendor prefix for Kaohsiung"
Previous message: Tony Lindgren: "Re: [PATCH 03/14] ARM: OMAP2+: omap_hwmod: Introduce HWMOD_NEEDS_REIDLE"
In reply to: Bjorn Helgaas: "Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system"
Next in thread: Keith Busch: "Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 4/12/2018 10:06 AM, Bjorn Helgaas wrote:
> [+cc Alex because of his interest in device reset]
>
> For context, here's the part of the patch we're discussing:
>
>>>> static void dpc_work(struct work_struct *work)
>>>> {
>>>> struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
>>>> struct pci_dev *pdev = dpc->dev->port;
>>>>
>>>> /* From DPC point of view error is always FATAL. */
>>>> - pcie_do_recovery(pdev, DPC_FATAL);
>>>> + if (!pdev->is_hotplug_bridge)
>>>> + pcie_do_recovery(pdev, DPC_FATAL);
>>>> + else
>>>> + dpc_reset_link_remove_dev(pdev);
>>>> }
>
> This is at the point where DPC has been triggered in the hardware and
> the DPC driver is starting recovery, and I'm wondering why we need to
> handle the "!pdev->is_hotplug_bridge" case differently.
>
> On Wed, Apr 11, 2018 at 09:41:56PM -0400, Sinan Kaya wrote:
>> On 4/10/2018 5:03 PM, Bjorn Helgaas wrote:
>>>> DPC and AER should attempt recovery in the same way, except the
>>>> cases where system is with hotplug enabled.
>>> What's the connection with hotplug? I see from the patch that for
>>> hotplug bridges you remove the tree below the bridge, and otherwise
>>> you just reset the secondary link (I think).
>>>
>>> The changelog should explain why we need the difference.
>>>
>>> I'm a little skeptical to begin with, because I'm not sure why we
>>> should handle a DPC event differently just because a bridge has the
>>> *capability* of hotplug. Even if a hotplug bridge reports a DPC
>>> event, that doesn't necessarily mean a hotplug has occurred.
>>
>> Let's do a recap on what we have discussed about this until now.
>>
>> There are two conflicting error recovery mechanisms for PCIe.
>>
>> If a system supports both hotplug and DPC, endpoint can be removed
>> and inserted safely. DPC driver shuts down the driver on link down.
>> When link comes back up, hotplug driver takes over and initiates an
>> enumeration process. Keith mentioned the stop and re-enumerate
>> design was chosen because someone could remove a drive and insert an
>> unrelated drive back to the system. We can't really save and
>> restore state as we do in the AER path.
>>
>> Now, let's assume a system without hotplug capability. Second
>> mechanism is to go through DPC/AER path and do an automatic link
>> down recovery via DPC retrain/secondary bus reset including register
>> save and restore. Second mechanism is more suitable for handling
>> "surprise link down" event. The goal is to retrain the link and
>> continue driver operation.
>>
>> The goal of this patch to separate these two cases from each other
>> as the DPC driver needs to work on both contexts. Current DPC code
>> doesn't handle the second use case.
>
> I think the scenario you are describing is two systems that are
> identical except that in the first, the endpoint is below a hotplug
> bridge, while in the second, it's below a non-hotplug bridge. There's
> no physical hotplug (no drive removed or inserted), and DPC is
> triggered in both systems.
>
> I suggest that DPC should be handled identically in both systems:
>
> - The PCI core should have the same view of the endpoint: it should
> be removed and re-added in both cases (or in neither case).
>
> - The endpoint itself should not be able to tell the difference: it
> should see a link down event, followed by a link retrain, followed
> by the same sequence of config accesses, etc.
>
> - The endpoint driver should not be able to tell the difference,
> i.e., we should be calling the same pci_error_handlers callbacks
> in both cases.
>
> It's true that in the non-hotplug system, pciehp probably won't start
> re-enumeration, so we might need an alternate path to trigger that.
>
> But that's not what we're doing in this patch. In this patch we're
> adding a much bigger difference: for hotplug bridges, we stop and
> remove the hierarchy below the bridge; for non-hotplug bridges, we do
> the AER-style flow of calling pci_error_handlers callbacks.

Our approach on V12 was to go to AER style recovery for all DPC events
regardless of hotplug support or not.

Keith was not comfortable with this approach. That's why, we special cased
hotplug.

If we drop 6/6 on this patch on v13, we achieve this. We still have to
take care of Keith's inputs on individual patches.

we have been struggling with the direction for a while.

Keith, what do you think?

>
> Bjorn
>

--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

Next message: Lukasz Majewski: "[PATCH] doc: Add vendor prefix for Kaohsiung"
Previous message: Tony Lindgren: "Re: [PATCH 03/14] ARM: OMAP2+: omap_hwmod: Introduce HWMOD_NEEDS_REIDLE"
In reply to: Bjorn Helgaas: "Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system"
Next in thread: Keith Busch: "Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]