Re: [PATCH v2] PCI: pciehp: Optimize PCIe root resume time

From: Lukas Wunner
Date: Thu Jan 19 2017 - 00:15:08 EST


On Thu, Jan 19, 2017 at 02:57:26AM +0000, Shankar, Vaibhav wrote:
> > From: Lukas Wunner [mailto:lukas@xxxxxxxxx]
> > Sent: Tuesday, January 17, 2017 9:14 PM
> > On Wed, Jan 18, 2017 at 01:32:13AM +0000, Shankar, Vaibhav wrote:
> > > > From: Bjorn Helgaas [mailto:helgaas@xxxxxxxxxx]
> > > > Sent: Wednesday, January 11, 2017 10:37 AM On Mon, Dec 12, 2016 at
> > > > 04:32:25PM -0800, Vaibhav Shankar wrote:
> > > > > On Apollolake platforms, PCIe rootport takes a long time to resume
> > > > > from S3. With 100ms delay before read pci conf, rootport takes
> > > > > ~200ms during resume.
> > > > >
> > > > > commit 2f5d8e4ff947 ("PCI: pciehp: replace unconditional sleep
> > > > > with config space access check") is the one that added the 100ms
> > > > > delay before reading pci conf.
> > > > >
> > > > > This patch includes a condition check for 100ms dealy before
> > > > > reading PCIe conf. This delay in included only when PCIe
> > > > > max_bus_speed > 5.0 GT/s. Root port takes ~16ms during resume.
> > > >
> > > > This patch reduces the delay by 100ms for devices that don't support
> > > > 5.0 GT/s. Please include references to the specs about the
> > > > necessary delays and explain why we don't need this 100ms delay.
> > > >
> > > > Presumably there's something in the spec about needing extra delay
> > > > when supporting 5.0 GT/s.
> > > >
> > > > This is generic code, so we can't make changes based on specific
> > > > devices like Apollolake. We have to make the code follow the spec
> > > > so it works for everybody.
> > > >
> > > > > With 100ms delay:
> > > > > [ 155.102713] calling 0000:00:14.0+ @ 70, parent: pci0000:00, cb:
> > > > > pci_pm_resume_noirq [ 155.119337] call 0000:00:14.0+ returned 0
> > > > > after
> > > > > 16231 usecs [ 155.119467] calling 0000:01:00.0+ @ 5845, parent:
> > > > > 0000:00:14.0, cb: pci_pm_resume_noirq [ 155.321670] call
> > > > > 0000:00:14.0+ returned 0 after 185327 usecs [ 155.321743] calling
> > > > > 0000:01:00.0+ @ 5849, parent: 0000:00:14.0, cb: pci_pm_resume
> > > > >
> > > > > With Condition check:
> > > > > [ 36.624709] calling 0000:00:14.0+ @ 4434, parent: pci0000:00, cb:
> > > > pci_pm_resume_noirq
> > > > > [ 36.641367] call 0000:00:14.0+ returned 0 after 16263 usecs
> > > > > [ 36.652458] calling 0000:00:14.0+ @ 4443, parent: pci0000:00, cb:
> > > > pci_pm_resume
> > > > > [ 36.652673] call 0000:00:14.0+ returned 0 after 208 usecs
> > > > > [ 36.652863] calling 0000:01:00.0+ @ 4442, parent: 0000:00:14.0, cb:
> > > > pci_pm_resume
> > > > >
> > > > > Signed-off-by: Vaibhav Shankar <vaibhav.shankar@xxxxxxxxx>
> > > > > ---
> > > > > changes in v2:
> > > > > - Modify patch description.
> > > > > - Add condition check for 100ms delay before read pci conf as
> > > > > suggested by Yinghai.
> > > > >
> > > > > drivers/pci/hotplug/pciehp_hpc.c | 11 +++++++++--
> > > > > 1 file changed, 9 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/pci/hotplug/pciehp_hpc.c
> > > > > b/drivers/pci/hotplug/pciehp_hpc.c
> > > > > index b57fc6d..2b10e5f 100644
> > > > > --- a/drivers/pci/hotplug/pciehp_hpc.c
> > > > > +++ b/drivers/pci/hotplug/pciehp_hpc.c
> > > > > @@ -311,8 +311,15 @@ int pciehp_check_link_status(struct
> > > > > controller
> > > > *ctrl)
> > > > > else
> > > > > msleep(1000);
> > > > >
> > > > > - /* wait 100ms before read pci conf, and try in 1s */
> > > > > - msleep(100);
> > > > > + /*
> > > > > + * If the port supports Link speeds greater than 5.0 GT/s, we
> > > > > + * must wait for 100 ms after Link training completes before
> > > > > + * sending configuration request.
> > > > > + */
> > > > > + if (ctrl->pcie->port->subordinate->max_bus_speed >
> > > > PCIE_SPEED_5_0GT)
> > > > > + msleep(100);
> > > > > +
> > > > > + /* try in 1s */
> > > > > found = pci_bus_check_dev(ctrl->pcie->port->subordinate,
> > > > > PCI_DEVFN(0, 0));
> > > > >
> > >
> > > Please find the details from regarding delays from PCIe spec 3.0:
> > >
> > > 1) With a Downstream Port that does not support Link speeds greater
> > > than 5.0 GT/s, software must wait a minimum of 100 ms before sending
> > > a Configuration Request to the device immediately below that Port.
> > >
> > > 2) With a Downstream Port that supports Link speeds greater than 5.0
> > > GT/s, software must wait a minimum of 100 ms after Link training
> > > completes before sending a Configuration Request to the device
> > > immediately below that Port. Software can determine when Link training
> > > completes by polling the Data Link Layer Link Active bit or by setting up an
> > associated interrupt (see Section 6.7.3.3).
> > >
> > > 3) A system must guarantee that all components intended to be software
> > > visible at boot time are ready to receive Configuration Requests
> > > within the applicable minimum period based on the end of Conventional
> > > Reset at the Root Complex - how this is done is beyond the scope of this
> > specification.
> > >
> > > 4) Note: Software should use 100 ms wait periods only if software
> > > enables CRS Software Visibility. Otherwise, Completion timeouts,
> > > platform timeouts, or lengthy processor instruction stalls may result.
> > > See the Configuration Request Retry Status Implementation Note in
> > Section 2.3.1.
> > >
> > > The spec says we have to wait for 100ms before sending configuration
> > request to the device.
> > > On older platforms like Skylake, PCIe was never suspended during S3
> > because Pcie was not on Vnn rail. Hence this delay never impacted S3
> > resume.
> > >
> > > On newer platforms like Apollolake , PCIe IP is on Vnn rail. When PCIe root
> > ports are suspended during S3, 100ms is in the critical path during PCIe root
> > port resume . This delay impacts S3 kernel resume time by ~60ms.
> >
> >
> > You did not provide the section number in the spec for the paragraphs you
> > quoted. The section number is 6.6.
> >
> > In the paragraphs you quoted, it says that a minimum of 100 ms is required
> > both for link speeds < 5 GT/s and > 5 GT/s, so why remove it for the < 5 GT/s
> > case?
> >
> > pciehp_check_link_status() is only executed when a new device is
> > hotplugged to a running system, yet you claim that your patch solves an
> > issue during resume. However when coming out of resume, we walk down
> > the hierarchy in:
> >
> > pci_pm_resume_noirq
> > pci_pm_default_resume_early
> > pci_power_up
> > pci_raw_set_power_state
> > pci_update_current_state
> > pci_restore_state
> >
> > AFAICS we're not performing the required delays and link active polling
> > there. In fact I'm often seeing issues on my Light Ridge thunderbolt
> > controller where devices fail to come out of D3 because we apparently don't
> > wait long enough for the link to go up before writing to their PMCSR.
>
> From Analyze suspend logs I see the following functions calls during S3 resume.
>
> During root port "resume"
> pci_pm_resume
> pcie_port_device_resume
> pciehp_resume
> pcie_enable_slot
> pciehp_check_link_status
> msleep(100) --> delay before read_pcie conf
>
> The dmesg logs also show that by removing msleep dealy we are able to improve pcie root port resume time from ~185ms to ~16ms.
>
> I understand that this delay is part of the spec. When we suspend root ports using these patches - (http://www.spinics.net/lists/linux-pci/msg49313.html) , we see that it adds considerable delay during pcie root port resume.
>
> 1) Could you please suggest if there is an alternate way how we can reduce root port resume time?
>
> 2) If it not possible to remove this delay. Could you please suggest if we can make this change local to our kernel? Our devices are able to come out of D3 and we see no issues.


As I've explained in my previous e-mail, the devices below hotplug ports
are already accessed during the ->resume_noirq phase. To conform to the
spec we should have delays there. AFAICS we lack those. So we don't
conform to the spec.

This applies to devices that were already attached before suspend.
The delay in pciehp_check_link_status() is only needed for devices
that were attached while the machine was asleep and that are newly
discovered during ->resume, as well as for devices attached at
runtime. The delay there is unnecessary for devices that were
already attached before suspend because the link should already be up.

This hopefully gives you enough hints to write a patch that lets us
conform to the spec and avoid unnecessary delays.

BTW it would be great if you could configure Outlook to limit lines to
80 chars. Please also avoid quoting the header of e-mails you reply to.

Thanks,

Lukas