Re: [PATCH v2 1/1] PCI: pciehp: Skip DLLSC handling if DPC is triggered

From: Lukas Wunner
Date: Wed Mar 17 2021 - 15:22:14 EST


On Wed, Mar 17, 2021 at 10:54:09AM -0700, Sathyanarayanan Kuppuswamy Natarajan wrote:
> Flush of hotplug event after successful recovery, and a simulated
> hotplug link down event after link recovery fails should solve the
> problems raised by Lukas. I assume Lukas' proposal adds this support.
> I will check his patch shortly.

Thank you!

I'd like to get a better understanding of the issues around hotplug/DPC,
specifically I'm wondering:

If DPC recovery was successful, what is the desired behavior by pciehp,
should it ignore the Link Down/Up or bring the slot down and back up
after DPC recovery?

If the events are ignored, the driver of the device in the hotplug slot
is not unbound and rebound. So the driver must be able to cope with
loss of TLPs during DPC recovery and it must be able to cope with
whatever state the endpoint device is in after DPC recovery.
Is this really safe? How does the nvme driver deal with it?

Also, if DPC is handled by firmware, your patch does not ignore the
Link Down/Up events, so pciehp brings down the slot when DPC is
triggered, then brings it up after succesful recovery. In a code
comment, you write that this behavior is okay because there's "no
race between hotplug and DPC recovery". However, Sinan wrote in
2018 that one of the issues with hotplug versus DPC is that pciehp
may turn off slot power and thereby foil DPC recovery. (Power off =
cold reset, whereas DPC recovery = warm reset.) This can occur
as well if DPC is handled by firmware.

So I guess pciehp should make an attempt to await DPC recovery even
if it's handled by firmware? Or am I missing something? We may be
able to achieve that by polling the DPC Trigger Status bit and
DLLLA bit, but it won't work as perfectly as with native DPC support.

Finally, you write in your commit message that there are "a lot of
stability issues" if pciehp and DPC are allowed to recover freely
without proper serialization. What are these issues exactly?
(Beyond the slot power issue mentioned above, and that the endpoint
device's driver should presumably not be unbound if DPC recovery
was successful.)

Thanks!

Lukas