Re: [PATCH] PCI: Add missing link delays required by the PCIe spec
From: Mika Westerberg
Date: Tue Aug 27 2019 - 05:25:50 EST
On Mon, Aug 26, 2019 at 05:05:02PM -0500, Bjorn Helgaas wrote:
> On Mon, Aug 26, 2019 at 05:42:42PM +0300, Mika Westerberg wrote:
> > On Mon, Aug 26, 2019 at 09:07:12AM -0500, Bjorn Helgaas wrote:
> > > On Mon, Aug 26, 2019 at 01:17:26PM +0300, Mika Westerberg wrote:
> > > > On Fri, Aug 23, 2019 at 09:12:54PM -0500, Bjorn Helgaas wrote:
>
> > > > > But the "wait downstream" part seems a little too specific to be at
> > > > > the .resume_noirq and .runtime_resume level.
> > > > >
> > > > > Do we descend the hierarchy and call .resume_noirq and .runtime_resume
> > > > > for the children of the bridge, too?
> > > >
> > > > We do but at that time it is too late as we have already resumed pciehp
> > > > of the parent downstream port and it may have already started tearing
> > > > down the device stack below.
> > > >
> > > > I'm open to any better ideas where this delay could be added, though :)
> > >
> > > So we resume pciehp *before* we're finished resuming the Downstream
> > > Port? That sounds wrong.
> >
> > I mean once we resume the downstream port (the bridge) we also resume
> > "PCIe port services" including pciehp and only then descend to whatever
> > device is physically connected to that port.
>
> That order sounds right. I guess I'd have to see more details about
> what's happening with pciehp to understand this. Do you happen to
> have a trace (dmesg, backtrace, etc) of pciehp tearing things down?
Here are the relevant parts from ICL. I'll send you the full dmesg as
a separate email. The topology is such that I have 2 Titan Ridge devices
connected in chain (each include PCIe switch + xHCI). I wait for the
whole hierarchy to enter D3cold:
[ 50.485411] thunderbolt 0000:00:0d.3: power state changed by ACPI to D3cold
Here I plug in normal type-C memory stick to last Titan Ridge port which
wakes up the hierarchy:
[ 63.404860] thunderbolt 0000:00:0d.3: power state changed by ACPI to D0
[ 63.408558] thunderbolt 0000:00:0d.2: power state changed by ACPI to D0
[ 63.512696] pcieport 0000:00:07.3: power state changed by ACPI to D0
[ 63.512921] pcieport 0000:00:07.0: power state changed by ACPI to D0
[ 63.512960] pcieport 0000:00:07.2: power state changed by ACPI to D0
[ 63.519567] pcieport 0000:00:07.1: power state changed by ACPI to D0
[ 63.524365] thunderbolt 0000:00:0d.3: PME# disabled
[ 63.524379] thunderbolt 0000:00:0d.3: control channel starting...
[ 63.524382] thunderbolt 0000:00:0d.3: starting TX ring 0
[ 63.524388] thunderbolt 0000:00:0d.3: enabling interrupt at register 0x38200 bit 0 (0x0 -> 0x1)
[ 63.524390] thunderbolt 0000:00:0d.3: starting RX ring 0
[ 63.524396] thunderbolt 0000:00:0d.3: enabling interrupt at register 0x38200 bit 12 (0x1 -> 0x1001)
[ 63.525777] thunderbolt 0000:00:0d.2: PME# disabled
[ 63.592667] thunderbolt 0000:00:0d.3: ICM rtd3 veto=0x00000001
[ 63.627014] pcieport 0000:00:07.3: restoring config space at offset 0x2c (was 0x60, writing 0x60)
[ 63.627018] pcieport 0000:00:07.0: restoring config space at offset 0x2c (was 0x60, writing 0x60)
[ 63.627026] pcieport 0000:00:07.0: restoring config space at offset 0x28 (was 0x60, writing 0x60)
[ 63.627028] pcieport 0000:00:07.3: restoring config space at offset 0x28 (was 0x60, writing 0x60)
[ 63.627034] pcieport 0000:00:07.3: restoring config space at offset 0x24 (was 0x7bf16001, writing 0x7bf16001)
[ 63.627036] pcieport 0000:00:07.0: restoring config space at offset 0x24 (was 0x1bf10001, writing 0x1bf10001)
[ 63.632439] pcieport 0000:00:07.0: PME# disabled
[ 63.632465] pcieport 0000:00:07.2: restoring config space at offset 0x2c (was 0x60, writing 0x60)
[ 63.632467] pcieport 0000:00:07.2: restoring config space at offset 0x28 (was 0x60, writing 0x60)
[ 63.632468] pcieport 0000:00:07.2: restoring config space at offset 0x24 (was 0x5bf14001, writing 0x5bf14001)
[ 63.634219] pcieport 0000:00:07.3: PME# disabled
[ 63.634324] pcieport 0000:82:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
[ 63.634333] pcieport 0000:82:00.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
[ 63.634337] pcieport 0000:82:00.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
[ 63.634341] pcieport 0000:82:00.0: restoring config space at offset 0x24 (was 0x10001, writing 0x7bf16001)
[ 63.634344] pcieport 0000:82:00.0: restoring config space at offset 0x20 (was 0x0, writing 0x5c105000)
[ 63.634348] pcieport 0000:82:00.0: restoring config space at offset 0x1c (was 0x101, writing 0xb171)
[ 63.634352] pcieport 0000:82:00.0: restoring config space at offset 0x18 (was 0x0, writing 0xac8382)
[ 63.634363] pcieport 0000:82:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100007)
[ 63.634852] pcieport 0000:00:07.1: restoring config space at offset 0x2c (was 0x60, writing 0x60)
[ 63.634854] pcieport 0000:00:07.1: restoring config space at offset 0x28 (was 0x60, writing 0x60)
[ 63.634856] pcieport 0000:00:07.1: restoring config space at offset 0x24 (was 0x3bf12001, writing 0x3bf12001)
[ 63.636163] pcieport 0000:00:07.2: PME# disabled
[ 63.636320] pcieport 0000:82:00.0: PME# disabled
[ 63.636451] pcieport 0000:83:02.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
[ 63.636473] pcieport 0000:83:02.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
[ 63.636480] pcieport 0000:83:02.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
[ 63.636488] pcieport 0000:83:02.0: restoring config space at offset 0x24 (was 0x10001, writing 0x60116011)
[ 63.636495] pcieport 0000:83:02.0: restoring config space at offset 0x20 (was 0x0, writing 0x50105010)
[ 63.636502] pcieport 0000:83:02.0: restoring config space at offset 0x1c (was 0x101, writing 0x8181)
[ 63.636510] pcieport 0000:83:02.0: restoring config space at offset 0x18 (was 0x0, writing 0x858583)
[ 63.636529] pcieport 0000:83:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
[ 63.636575] pcieport 0000:83:04.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
[ 63.636597] pcieport 0000:83:04.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
[ 63.636603] pcieport 0000:83:04.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
[ 63.636609] pcieport 0000:83:04.0: restoring config space at offset 0x24 (was 0x10001, writing 0x7bf16021)
[ 63.636615] pcieport 0000:83:04.0: restoring config space at offset 0x20 (was 0x0, writing 0x5c105020)
[ 63.636621] pcieport 0000:83:04.0: restoring config space at offset 0x1c (was 0x101, writing 0xb191)
[ 63.636627] pcieport 0000:83:04.0: restoring config space at offset 0x18 (was 0x0, writing 0xac8683)
[ 63.636641] pcieport 0000:83:04.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
[ 63.639915] pcieport 0000:83:01.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
[ 63.639926] pcieport 0000:83:01.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
[ 63.639931] pcieport 0000:83:01.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
[ 63.639936] pcieport 0000:83:01.0: restoring config space at offset 0x24 (was 0x10001, writing 0x60016001)
[ 63.639941] pcieport 0000:83:01.0: restoring config space at offset 0x20 (was 0x0, writing 0x50005000)
[ 63.639946] pcieport 0000:83:01.0: restoring config space at offset 0x1c (was 0x101, writing 0x7171)
[ 63.639951] pcieport 0000:83:01.0: restoring config space at offset 0x18 (was 0x0, writing 0x848483)
[ 63.639963] pcieport 0000:83:01.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
[ 63.640473] pcieport 0000:00:07.1: PME# disabled
[ 63.640515] pcieport 0000:83:04.0: PME# disabled
[ 63.640537] pcieport 0000:83:04.0: pciehp: Slot(4): Card not present
Here pciehp notices the card is not present and starts tearing down the
devices below.
[ 63.640539] pcieport 0000:87:04.0: PME# disabled
[ 63.648104] pcieport 0000:83:04.0: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:86:00
[ 63.648105] pcieport 0000:86:00.0: Refused to change power state, currently in D3
[ 63.656858] pcieport 0000:86:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x201ff)
[ 63.656858] pcieport 0000:86:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0)
[ 63.656859] pcieport 0000:86:00.0: restoring config space at offset 0x34 (was 0xffffffff, writing 0x80)
[ 63.656859] pcieport 0000:86:00.0: restoring config space at offset 0x30 (was 0xffffffff, writing 0x0)
[ 63.656860] pcieport 0000:86:00.0: restoring config space at offset 0x2c (was 0xffffffff, writing 0x60)
[ 63.656860] pcieport 0000:86:00.0: restoring config space at offset 0x28 (was 0xffffffff, writing 0x60)
[ 63.656861] pcieport 0000:86:00.0: restoring config space at offset 0x24 (was 0xffffffff, writing 0x7bf16021)
[ 63.656862] pcieport 0000:86:00.0: restoring config space at offset 0x20 (was 0xffffffff, writing 0x5c105020)
[ 63.656862] pcieport 0000:86:00.0: restoring config space at offset 0x1c (was 0xffffffff, writing 0xb191)
[ 63.656863] pcieport 0000:86:00.0: restoring config space at offset 0x18 (was 0xffffffff, writing 0xac8786)
[ 63.656863] pcieport 0000:86:00.0: restoring config space at offset 0x14 (was 0xffffffff, writing 0x0)
[ 63.656864] pcieport 0000:86:00.0: restoring config space at offset 0x10 (was 0xffffffff, writing 0x0)
[ 63.656864] pcieport 0000:86:00.0: restoring config space at offset 0xc (was 0xffffffff, writing 0x10000)
[ 63.656865] pcieport 0000:86:00.0: restoring config space at offset 0x8 (was 0xffffffff, writing 0x6040006)
[ 63.656865] pcieport 0000:86:00.0: restoring config space at offset 0x4 (was 0xffffffff, writing 0x100007)
[ 63.656866] pcieport 0000:86:00.0: restoring config space at offset 0x0 (was 0xffffffff, writing 0x15ef8086)
[ 63.656883] pcieport 0000:86:00.0: PME# disabled
[ 63.656884] pcieport 0000:87:04.0: Refused to change power state, currently in D3
There is also another case which does not involve pciehp. The xHCI issue
Kai-Heng reported. In that case PCI core tries to access xHCI too soon
and fails to resume it:
[ 74.100492] pcieport 0000:04:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
[ 74.100498] pcieport 0000:04:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100406)
[ 74.100545] pcieport 0000:04:02.0: PME# disabled
[ 74.100554] xhci_hcd 0000:39:00.0: Refused to change power state, currently in D3
[ 74.102769] xhci_hcd 0000:39:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x1ff)
[ 74.102774] xhci_hcd 0000:39:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0)
[ 74.102778] xhci_hcd 0000:39:00.0: restoring config space at offset 0x34 (was 0xffffffff, writing 0x80)
[ 74.102782] xhci_hcd 0000:39:00.0: restoring config space at offset 0x30 (was 0xffffffff, writing 0x0)
[ 74.102787] xhci_hcd 0000:39:00.0: restoring config space at offset 0x2c (was 0xffffffff, writing 0x9261028)
the full dmesg is here:
https://bugzilla.kernel.org/attachment.cgi?id=284427
> > > > > > +static int pcie_get_downstream_delay(struct pci_bus *bus)
> > > > > > +{
> > > > > > + struct pci_dev *pdev;
> > > > > > + int min_delay = 100;
> > > > > > + int max_delay = 0;
> > > > > > +
> > > > > > + list_for_each_entry(pdev, &bus->devices, bus_list) {
> > > > > > + if (pdev->imm_ready)
> > > > > > + min_delay = 0;
> > > > > > + else if (pdev->d3cold_delay < min_delay)
> > > > > > + min_delay = pdev->d3cold_delay;
> > > > > > + if (pdev->d3cold_delay > max_delay)
> > > > > > + max_delay = pdev->d3cold_delay;
> > > > > > + }
> > > > > > +
> > > > > > + return max(min_delay, max_delay);
> > > > > > +}
> > >
> > > > > > + */
> > > > > > +void pcie_wait_downstream_accessible(struct pci_dev *pdev)
> > >
> > > > > Do we need to observe the Trhfa requirements for Conventional PCI and
> > > > > PCI-X devices here? If we don't do it here, where is it observed?
> > > >
> > > > We probably should but I intended this to be PCIe specific since there
> > > > we have issues. For conventional PCI/PCI-X things "seem" to work and we
> > > > don't power manage those bridges anyway.
> > > >
> > > > I'm not aware if Trhfa is handled in anywhere in the PCI stack
> > > > currently.
> > >
> > > I think we should make this agnostic of the Conventional/PCIe
> > > difference if possible. I assume we can tell if we're dealing with a
> > > D3->D0 transition and we only add delays in that case. If we don't
> > > power manage Conventional PCI devices, I assume we won't see D3->D0
> > > transitions for runtime resume so there won't be any harm.
> > >
> > > Making it PCIe-specific seems like it adds extra code ("dev-is-PCIe
> > > checks") with no obvious reason for existence and an implicit
> > > dependency on the fact that we only power manage PCIe devices. If we
> > > ever *did* support runtime power-management for Conventional PCI, we'd
> > > trip over that implicit dependency and probably debug this issue
> > > again.
> > >
> > > But I guess it might slow down system resume for Conventional PCI
> > > devices. If we rely on delays in firmware, I wonder if there's
> > > any point during resume where we could grab an early timestamp, then
> > > take another timestamp here and deduce that we've already delayed the
> > > difference?
> >
> > That sounds rather complex, to be honest ;-)
>
> Maybe so, I was just trying to brainstorm possibilities for making
> sure we observe the delay requirements without slowing down resume.
>
> For example, if we have several devices on the same bus, we shouldn't
> have to do the delays serially; we should be able to take advantage of
> the fact that the Trhfa period starts at the same time for all of
> them.
Also we do async suspend these days for PCI devices so I think sibling
devices are already resumed concurrently.
> > > > > > + delay = pcie_get_downstream_delay(bus);
> > > > > > + if (!delay)
> > > > > > + return;
> > > > >
> > > > > I'm not sold on the idea that this delay depends on what's *below* the
> > > > > bridge. We're using sec 6.6.1 to justify the delay, and that section
> > > > > doesn't say anything about downstream devices.
> > > >
> > > > 6.6.1 indeed talks about Downstream Ports and devices immediately below
> > > > them.
> > >
> > > Wait, I don't think we're reading this the same way. 6.6.1 mentions
> > > Downstream Ports: "With a Downstream Port that does not support Link
> > > speeds greater than 5.0 GT/s, software must wait a minimum of 100 ms
> > > before sending a Configuration Request to the device immediately below
> > > that Port."
> > >
> > > This says we have to delay before sending a config request to a device
> > > below a Downstream Port, but it doesn't say anything about the
> > > characteristics of that device. In particular, I don't think it says
> > > the delay can be shortened if that device supports Immediate Readiness
> > > or implements a readiness _DSM.
> >
> > Well it says this:
> >
> > To allow components to perform internal initialization, system software
> > must wait a specified minimum period following the end of a Conventional
> > Reset of one or more devices before it is permitted to issue
> > Configuration Requests to those devices, unless Readiness Notifications
> > mechanisms are used
> >
> > My understanding of the above (might be wrong) is that Readiness
> > Notification can shorten the delay.
>
> Yeeesss, but if we're talking about transitioning device X from
> D3->D0, I think this is talking about config requests to device X,
> not to something downstream from X.
>
> And we have no support for Readiness Notifications, although maybe the
> _DSM stuff qualifies as "a mechanism that supersedes FRS and/or DRS"
> (as mentioned in 6.23).
>
> If device X was in D3cold, don't we have to assume that devices
> downstream from X may have been added/removed while X was in D3cold?
Yes, I think so.