On 3/11/2020 4:27 PM, Kuppuswamy Sathyanarayanan wrote:Yes, device/driver enumeration and removal will triggered by DLLSC
[EXTERNAL EMAIL]If the port has hot-plug enabled then DPC trigger will cause the link to
Hi,
On 3/11/20 1:33 PM, Bjorn Helgaas wrote:
On Wed, Mar 11, 2020 at 05:27:35PM +0000, Austin.Bolen@xxxxxxxx wrote:
On 3/11/2020 12:12 PM, Bjorn Helgaas wrote:Makes sense, thanks.
[EXTERNAL EMAIL]<SNIP>
I'm probably missing your intent, but that sounds like "the OS canYes, by treating AER bits like DPC bits I meant I'd define the specific
read/write AER bits whenever it wants, regardless of ownership."
That doesn't sound practical to me, and I don't think it's really
similar to DPC, where it's pretty clear that the OS can touch DPC bits
it doesn't own but only *during the EDR processing window*.
time windows when OS can touch the AER status bits similar to how it's
done for DPC in the current ECN.
Hmm, I think I might be confusing this with another situation. Sathy,DPC resets everything below it and so to get it back up and running itYeah. I don't know how to solve this.I can just state that it's done after OST returns but before _HPX orFor the normative text describing when OS clears the AER bitsI'm not sure what to do with "as soon as possible" either. That
following the informative flow chart, it could say that OS clears
AER as soon as possible after OST returns and before OS processes
_HPX and loading drivers. Open to other suggestions as well.
doesn't seem like something firmware and the OS can agree on.
driver is loaded. Any time in that range is fine. I can't get super
specific here because different OSes do different things. Even for
a given OS they change over time. And I need something generic
enough to support a wide variety of OS implementations.
Linux doesn't actually unload and reload drivers for the child devices
(Sathy, correct me if I'm wrong here) even though DPC containment
takes the link down and effectively unplugs and replugs the device. I
would *like* to handle it like hotplug, but some higher-level software
doesn't deal well with things like storage devices disappearing and
reappearing.
Since Linux doesn't actually re-enumerate the child devices, it
wouldn't evaluate _HPX again. It would probably be cleaner if it did,
but it's all tied up with the whole unplug/replug problem.
would mean that all buses and resources need to be assigned, _HPX
evaluated, and drivers reloaded. If those things don't happen then the
whole hierarchy below the port that triggered DPC will be inaccessible.
can you help me understand this? I don't have a way to actually
exercise this EDR path. Is there some way the pciehp hotplug driver
gets involved here?
go down (disabled state) and will generate a DLLSC hot-plug interrupt.
When DPC is released, the link will become active and generate another
DLLSC hot-plug interrupt.
Now that I have a hardware to verify this scenario, I will look into
In our testing, the device directly connected to the port that wasHere's how this seems to work as far as I can tell:AFAIK, AER error status registers are sticky (RW1CS) and hence
- Linux does not have DPC or AER control
- Linux installs EDR notify handler
- Linux evaluates DPC Enable _DSM
- DPC containment event occurs
- Firmware fields DPC interrupt
- DPC event is not a surprise remove
- Firmware sends EDR notification
- Linux EDR notify handler evaluates Locate _DSM
- Linux reads and logs DPC and AER error information for port in
containment mode. [If it was an RP PIO error, Linux clears RP PIO
error status, which is an asymmetry with the non-RP PIO path.]
- Linux clears AER error status (pci_aer_raw_clear_status())
- Linux calls driver .error_detected() methods for all child devices
of the port in containment mode (pcie_do_recovery()). These
devices are inaccessible because the link is down.
- Linux clears DPC Trigger Status (dpc_reset_link() from
pcie_do_recovery()).
- Linux calls driver .mmio_enabled() methods for all child devices.
This is where I get lost. These child devices are now accessible, but
they've been reset, so I don't know how their config space got
restored. Did pciehp enumerate them? Did we do something like
pci_restore_state()? I don't see where either of these happens.
will be preserved during reset.
contained does get reprogrammed and the driver is reloaded. These are
hot-plug slots and so might be due to DLLSC hot-plug interrupt when
containment is released and link goes back to active state.
However, if a switch is connected to the port where DPC was triggered
then we do not see the whole switch hierarchy being re-enumerated.
If hotplug is not supported then there is support to enumerate
Also, DPC could be enabled on non-hot-plug slots so can't always rely on
hot-plug to re-init devices in the recovery path.
Right. The way it was pitched to me was that the OSVs wanted toSo they want to basically do native AER handling even though firmwareNo, Its meant only for clearing AER registers. In EDR path, since
owns AER? My head hurts.
OS owns clearing DPC registers, they want to let OS own clearing AER
registers as well. Also, it would give OS a chance to decide whether
we want to keep the device on based on error status and history of the
device attached.
read/clear the error status bits so they could re-use the code that does
that when OS natively owns AER/DPC.
Bjorn