PCI LTR - ASPM handling upon suspend / resume cycle. Regression since 4.18

From: Grumbach, Emmanuel
Date: Tue Jan 29 2019 - 02:05:21 EST


Hi,

Lately we (Intel) have got a few bugs on suspend / resume. The
complaint is that our device becomes unavailable after suspend / resume
cycle. The bug on which we have most data is [1].

The original submitter reported a regression since commit
9ab105deb60fa76d66cae5548819b4e8703d2056:

PCI/ASPM: Disable ASPM L1.2 Substate if we don't have LTR

When in the ASPM L1.0 state (but not the PCI-PM L1.0 state), the
most
recent LTR value and the LTR_L1.2_THRESHOLD determines whether the
link
enters the L1.2 substate.

If we don't have LTR enabled, prevent the use of ASPM L1.2.

PCI-PM L1.2 may still be used because it doesn't depend on
LTR_L1.2_THRESHOLD (see PCIe r4.0, sec 5.5.1).


After this commit, L1.2 is disabled upon resume:
L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+

T_CommonMode=0us LTR1.2_Threshold=163840ns

Whereas it wasn't before this commit:
L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
T_CommonMode=0us LTR1.2_Threshold=163840ns

I am copying here an initial analysis by Bjorn (from [2]):

1) Linux has no support for saving/restoring the Max Latency values in
the LTR Capability. This results in the latencies being zero after you
resume, as you see in the lspci output. The device still *works* after
resume, but power consumption should increase because the device is
effectively requesting the best possible service, so we probably don't
use the L1.2 state at all.

2) Linux has no support for programming the Max Latency values for hot-
added devices. When using ACPI hotplug, firmware may do this, but for
native PCIe hotplug (pciehp), the new device should again be requesting
the best possible service, resulting in more power consumption than
necessary. The platform is supposed to supply a _DSM method with
information required to program these values


Another user found another commit impacting his device after suspend /
resume:
commit 6f9db69ad93cd6ab77d5571cf748ff7cdcfb0285

ACPI / PM: Default to s2idle in all machines supporting LP S0

The Dell Venue Pro 7140 supports the Low Power S0 Idle state, but
does not support any of the _DSM functions that the current
heuristic
checks for.

Since suspend-to-mem can not be safely performed on this machine,
and since the bitfield check can't cover this case, it is safer
to enable s2idle by default by checking for the presence of the
_DSM alone and removing the bitfield check.

This user confirmed that using suspend-to-mem instead of suspend-to-
idle works for him.

A user contacted my privately to let me know that he has issues with
devices from other vendors although I can't tell if the problem is the
same or not.

Note that this problem started from kernel 4.18.

Thank you.

[1] - https://bugzilla.kernel.org/show_bug.cgi?id=201469
[2] - https://bugzilla.kernel.org/show_bug.cgi?id=201469#c26