Re: [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

From: Mika Westerberg
Date: Thu Nov 21 2019 - 06:46:17 EST


On Thu, Nov 21, 2019 at 12:34:22PM +0100, Rafael J. Wysocki wrote:
> On Thu, Nov 21, 2019 at 12:28 PM Mika Westerberg
> <mika.westerberg@xxxxxxxxx> wrote:
> >
> > On Wed, Nov 20, 2019 at 11:29:33PM +0100, Rafael J. Wysocki wrote:
> > > > last week or so I found systems where the GPU was under the "PCI
> > > > Express Root Port" (name from lspci) and on those systems all of that
> > > > seems to work. So I am wondering if it's indeed just the 0x1901 one,
> > > > which also explains Mikas case that Thunderbolt stuff works as devices
> > > > never get populated under this particular bridge controller, but under
> > > > those "Root Port"s
> > >
> > > It always is a PCIe port, but its location within the SoC may matter.
> >
> > Exactly. Intel hardware has PCIe ports on CPU side (these are called
> > PEG, PCI Express Graphics, ports), and the PCH side. I think the IP is
> > still the same.
> >
> > > Also some custom AML-based power management is involved and that may
> > > be making specific assumptions on the configuration of the SoC and the
> > > GPU at the time of its invocation which unfortunately are not known to
> > > us.
> > >
> > > However, it looks like the AML invoked to power down the GPU from
> > > acpi_pci_set_power_state() gets confused if it is not in PCI D0 at
> > > that point, so it looks like that AML tries to access device memory on
> > > the GPU (beyond the PCI config space) or similar which is not
> > > accessible in PCI power states below D0.
> >
> > Or the PCI config space of the GPU when the parent root port is in D3hot
> > (as it is the case here). Also then the GPU config space is not
> > accessible.
>
> Why would the parent port be in D3hot at that point? Wouldn't that be
> a suspend ordering violation?

No. We put the GPU into D3hot first, then the root port and then turn
off the power resource (which is attached to the root port) resulting
the topology entering D3cold.

> > I took a look at the HP Omen ACPI tables which has similar problem and
> > there is also check for Windows 7 (but not Linux) so I think one
> > alternative workaround would be to add these devices into
> > acpi_osi_dmi_table[] where .callback is set to dmi_disable_osi_win8 (or
> > pass 'acpi_osi="!Windows 2012"' in the kernel command line).
>
> I'd like to understand the facts that have been established so far
> before deciding what to do about them. :-)

Yes, I agree :)