Re: [PATCH v4 2/2] PCI: Enable NO_BUS_RESET quirk for Nvidia GPUs

From: Pali Rohár
Date: Wed May 05 2021 - 08:16:59 EST


On Friday 30 April 2021 17:11:23 Shanker R Donthineni wrote:
> Thanks Bjorn for reviewing patch.
>
> On 4/30/21 12:01 PM, Bjorn Helgaas wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Wed, Apr 28, 2021 at 07:49:07PM -0500, Shanker Donthineni wrote:
> >> On select platforms, some Nvidia GPU devices do not work with SBR.
> >> Triggering SBR would leave the device inoperable for the current
> >> system boot. It requires a system hard-reboot to get the GPU device
> >> back to normal operating condition post-SBR. For the affected
> >> devices, enable NO_BUS_RESET quirk to fix the issue.
> > Since 1/2 adds _RST support, should I infer that _RST works on these
> > Nvidia GPUs even though SBR does not? If so, how does _RST do the
> > reset?
> Yes, _RST method works but not SBR. The _RST method in DSDT-AML uses
> platform-specific initialization steps outside of the GPU BARs for resetting
> the GPU device.

Hello! If I understood this "reset" issue correctly, it means that
affected PCIe GPU device cannot be reset via PCI Secondary Bus Reset
(PCIe Warm Reset) and some special, platform specific reset type needs
to be issued.

And code for this platform specific reset is included in ACPI DSDT
table.

But because ACPI DSDT table is part of BIOS/firmware and not part of the
PCIe GPU device itself, it means that this kind of reset is available to
linux kernel only in the case when vendor of motherboard (or who burn
BIOS/firmware into motherboard EEPROM) includes this specific code into
HW. Am I Right?

So if this PCIe GPU device is connected to other motherboard or other
system then this special platform reset in ACPI DSDT is not available.

What is doing default APCI _RST() method on motherboards without this
special platform reset hook? It probably would not be able to reset
these PCIe GPU devices if standard SBR cannot reset them.

Would not be better to include for these PCIe devices "native" linux
code for resetting them?

Please correct me if I'm wrong in my assumption or if I understood this
issue incorrectly.

> > Do you have a root cause for why SBR doesn't work?
> It is a hardware implementation specific issue. GPU end-point device
> is inoperative after receiving SBR from the RP/SwitchPort. This quirk is
> to prevent SBR.
>
> > I'm not super
> > confident that we perform resets correctly in general, and if the
> > problem is an issue in Linux, it'd be nice to fix that.
> We have not seen any issue with Linux SBR implementation.
> >
> >> This issue will be fixed in the next generation of hardware.