Re: [PATCH v3] PCI/ACPI: PCI/ACPI: Validate devices with power resources support D3

From: Limonciello, Mario
Date: Wed Oct 26 2022 - 15:42:29 EST


On 10/26/2022 14:09, Bjorn Helgaas wrote:
Hi Mario,

Thanks for expanding the commit log. I'm sure this patch is the right
thing to do; I just want to connect a few more dots to make it less
likely that we'll break this in the future.

OK. I'll pick up those two tags from Mika and Rafael and try to reword the commit message a bit.


On Tue, Oct 25, 2022 at 05:10:54PM -0500, Mario Limonciello wrote:
Firmware typically advertises that PCIe devices can support D3
by a combination of the value returned by _S0W as well as the
HotPlugSupportInD3 _DSD.

All PCI devices are required to support both D3hot and D3cold (PCIe
r6.0, sec 5.3.1.4), so I think what's being advertised here is about
what firmware can do (which of course implicitly depends on controls
provided by the platform hardware), not what the *device* supports.

The OS can put a device in D3hot by itself with the PM Control
register, so I assume the important thing here is whether firmware has
interfaces to put a device in D3cold and bring it back to D0.

If you completely ignore _S0W, sure you can put the device into D3hot by changing this register but while in the firmware configuration I describe you won't get wake events to pull you out of it.

So the PCIe device will stay in this state until the OS does something.


I know we only get to acpi_pci_bridge_d3() for PCIe devices, but when
the device properties and ACPI interfaces are not PCIe-specific, I
don't think we should restrict it by saying "PCIe".

Thanks for clarifying. I'm thinking I'll s/PCIe/ACPI/ in the commit
message to convey this.


`acpi_pci_bridge_d3` looks for this combination but also contains
an assumption that if a device contains power resources it can support
D3. This was introduced from commit c6e331312ebf ("PCI/ACPI: Whitelist
hotplug ports for D3 if power managed by ACPI").

On some firmware configurations for "AMD Pink Sardine" D3 is not
supported for wake in _S0W for the PCIe root port for tunneling.
However the device will still be opted into runtime PM since
`acpi_pci_bridge_d3` returns since the ACPI device contains power
resources.

When the thunderbolt driver is loaded a device link between the USB4
router and the PCIe root port for tunneling is created where the PCIe
root port for tunneling is the consumer and the USB4 router is the
supplier. Here is a demonstration of this topology that occurs:

├─ 0000:00:03.1
| | ACPI Path: \_SB_.PCI0.GP11 (Supports "0" in _S0W)
| | Device Links: supplier:pci:0000:c4:00.5
| └─ D0 (Runtime PM enabled)
├─ 0000:00:04.1
| | ACPI Path: \_SB_.PCI0.GP12 (Supports "0" in _S0W)
| | Device Links: supplier:pci:0000:c4:00.6
| └─ D0 (Runtime PM enabled)
├─ 0000:00:08.3
| | ACPI Path: \_SB_.PCI0.GP19
| ├─ D0 (Runtime PM disabled)
| ├─ 0000:c4:00.3
| | | ACPI Path: \_SB_.PCI0.GP19.XHC3
| | | Device Links: supplier:pci:0000:c4:00.5
| | └─ D3cold (Runtime PM enabled)
| ├─ 0000:c4:00.4
| | | ACPI Path: \_SB_.PCI0.GP19.XHC4
| | | Device Links: supplier:pci:0000:c4:00.6
| | └─ D3cold (Runtime PM enabled)
| ├─ 0000:c4:00.5
| | | ACPI Path: \_SB_.PCI0.GP19.NHI0 (Supports "4" in _S0W)
| | | Device Links: consumer:pci:0000:00:03.1 consumer:pci:0000:c4:00.3
| | └─ D3cold (Runtime PM enabled)
| └─ 0000:c4:00.6
| | ACPI Path: \_SB_.PCI0.GP19.NHI1 (Supports "4" in _S0W)
| | Device Links: consumer:pci:0000:c4:00.4 consumer:pci:0000:00:04.1
| └─ D3cold (Runtime PM enabled)

Can you label the devices above to correspond with the preceding
paragraph? I assume one of the XHC devices is the USB4 router, but I
don't know which is the Root Port.

In the above example there are 2 USB4 routers, 2 PCIe root ports used for tunneling and 2 XHCI PCIe devices.


Are all the devices relevant to the problem? If not, prune the ones
that don't matter. It looks like the domain ("0000") could also be
pruned out.

They're all relevant because links are made between them. If the XHCI
PCIe device was not in runtime PM, it could prevent the router from going into runtime PM as well just the same.

I could remove one triplet of devices to keep it simpler (USB4 router, XHCI PCIe device and PCIe root port for tunneling) but all systems I've seen have 2.


If you also include the _PR0 and/or _PS0 methods, we'll be able to see
why the current code doesn't do what we want and why the new code
will.

OK, I'll add a line/assertion which have _PS0/_PR0.


What determines the device links? I assume there's some ACPI
information that connects the USB4 router with the Root Port?

They're created when the firmware node has "usb4-host-interface". I'll add some lines to describe this as well to the commit message.

https://github.com/torvalds/linux/blob/v6.1-rc1/drivers/thunderbolt/acpi.c#L29


What are the "D0" and "D3cold" annotations telling me? What does
"runtime PM enabled" mean? Is that determined based on some ACPI
methods?

D0/D3cold are tell you at rest what power states each ACPI device is in as read from /sys/bus/pci/devices/{bdf}/power_state

The USB4 routers are in D3cold, which shouldn't have occurred because the PCIe root ports for tunneling are in D0.

This only happened because runtime PM was enabled on the PCIe root port for tunneling, and runtime PM was enabled because acpi_pci_bridge_d3 asserted that it supported it.


Allowing the PCIe root port for tunneling to go into runtime PM (even if
it doesn't support D3) allows the USB4 router to also go into runtime PM.
The PCIe root port for tunneling stays in D0 but is in runtime PM. Due to
the device link the USB4 router transitions to D3cold when this happens.

It's probably obvious to PM folks what "going into runtime PM" means,
but it would help me out to describe it in terms of the hardware state
of the device, e.g., D3hot or whatever.

I think a simplified description here is "the device has been put into the deepest sleep state that it can wake itself from at runtime".

Suppliers can't go into runtime PM until all consumers of the device have done so. If a consumer doesn't support runtime PM then it will block the supplier from entering runtime PM.


The expectation is the USB4 router should have also remained in D0 since
the PCIe root port for tunneling remained in D0.

Instead of making this assertion from the power resources check
immediately, move the check to later on, which will have validated
that the device supports wake from D3hot or D3cold.

This fix prevents the USB4 router going into D3 when the firmware says that
the PCIe root port for tunneling can't handle it while still allowing
system that don't have the HotplugSupportInD3 _DSD to also enter D3 if they
have power resources that can wake from D3.

I guess there's a theme here of looking for concrete terms that I can
connect directly to the spec vs abstract things like "going into
runtime PM" or "root port can't handle D3" (which I think is actually
saying something about what *firmware* can do).

The UBS4 CM spec doesn't describe the ACPI relationships.
So I'm afraid the best I can come up with is what Microsoft says:

"For the PCIe and USB 3.x software stacks to establish power relations with the USB4 host router, device-specific data (_DSD) for the tunneled PCIe and USB 3.x ports is required. In the absence of this, the USB4 domain may power down without coordinating with the tunneled PCIe and USB 3.x devices."

https://learn.microsoft.com/en-us/windows-hardware/design/component-guidelines/usb4-acpi-requirements#port-mapping-_dsd-for-usb-3x-and-pcie

"Runtime PM" is the power relation that is used for Linux.

Back when we did dff6139015dc6 earlier this year the Yellow Carp systems didn't have power resources declared in this particularly firmware configuration so the behavior caused by c6e331312ebf didn't negatively affect anything.

If they did, I'd like to think I would have done this right the first time =).


Fixes: dff6139015dc6 ("PCI/ACPI: Allow D3 only if Root Port can signal and wake from D3")
Signed-off-by: Mario Limonciello <mario.limonciello@xxxxxxx>
---
v2->v3:
* Reword commit message
v1->v2:
* Just return value of acpi_pci_power_manageable (Rafael)
* Remove extra word in commit message
---
drivers/pci/pci-acpi.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index a46fec776ad77..8c6aec50dd471 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -984,10 +984,6 @@ bool acpi_pci_bridge_d3(struct pci_dev *dev)
if (acpi_pci_disabled || !dev->is_hotplug_bridge)
return false;
- /* Assume D3 support if the bridge is power-manageable by ACPI. */
- if (acpi_pci_power_manageable(dev))
- return true;
-
rpdev = pcie_find_root_port(dev);
if (!rpdev)
return false;
@@ -1023,7 +1019,8 @@ bool acpi_pci_bridge_d3(struct pci_dev *dev)
obj->integer.value == 1)
return true;
- return false;
+ /* Assume D3 support if the bridge is power-manageable by ACPI. */
+ return acpi_pci_power_manageable(dev);
}
int acpi_pci_set_power_state(struct pci_dev *dev, pci_power_t state)
--
2.34.1