Re: [PATCH 1/2] drm/amdgpu: Reset GPU on S0ix when device supports BOCO

From: Kai-Heng Feng
Date: Wed Mar 29 2023 - 23:36:40 EST


On Wed, Mar 29, 2023 at 9:23 PM Mario Limonciello
<mario.limonciello@xxxxxxx> wrote:
>
>
> On 3/29/23 04:59, Kai-Heng Feng wrote:
> > When the power is lost due to ACPI power resources being turned off, the
> > driver should reset the GPU so it can work anew.
> >
> > First, _PR3 support of the hierarchy needs to be found correctly. Since
> > the GPU on some discrete GFX cards is behind a PCIe switch, checking the
> > _PR3 on downstream port alone is not enough, as the _PR3 can associate
> > to the root port above the PCIe switch.
>
> I think this should be split into two commits:
>
> * One of them to look at _PR3 further up in hierarchy to fix indication
> for BOCO support.

Yes, this part can be split up.

>
> * One to adjust policy for whether to reset

IIUC, the GPU only needs to be reset when the power status isn't certain?

Assuming power resources in _PR3 are really disabled, GPU is already
reset by itself. That means reset shouldn't be necessary for D3cold,
am I understanding it correctly?

However, this is a desktop plugged with GFX card that has external
power, does that assumption still stand? Perform resetting on D3cold
can cover this scenario.

>
>
> > Once the _PR3 is found and BOCO support is correctly marked, use that
> > information to inform the GPU should be reset. This solves an issue that
> > system freeze on a Intel ADL desktop that uses S0ix for sleep and D3cold
> > is supported for the GFX slot.
>
> I'm worried this is still papering over an underlying issue with L0s
> handling on ALD + Navi1x/Navi2x.

Is it possible to get the ASIC's ASPM parameter under Windows? Knowing
the difference can be useful.

>
> Also, what about runtime suspend? If you unplug the monitor from this
> dGPU and interact with it over SSH it should go into runtime suspend.
>
> Is it working properly for that case now?

Thanks for the tip. Runtime resume doesn't work at all:
[ 1087.601631] pcieport 0000:00:01.0: power state changed by ACPI to D0
[ 1087.613820] pcieport 0000:00:01.0: restoring config space at offset
0x2c (was 0x43, writing 0x43)
[ 1087.613835] pcieport 0000:00:01.0: restoring config space at offset
0x28 (was 0x41, writing 0x41)
[ 1087.613841] pcieport 0000:00:01.0: restoring config space at offset
0x24 (was 0xfff10001, writing 0xfff10001)
[ 1087.613978] pcieport 0000:00:01.0: PME# disabled
[ 1087.613984] pcieport 0000:00:01.0: waiting 100 ms for downstream
link, after activation
[ 1089.330956] pcieport 0000:01:00.0: not ready 1023ms after resume; giving up
[ 1089.373036] pcieport 0000:01:00.0: Unable to change power state
from D3cold to D0, device inaccessible

After a short while the whole system froze.

So the upstream port of GFX's PCIe switch cannot be powered on again.

Kai-Heng

>
> >
> > Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
> > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885
> > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2458
> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 3 +++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++++-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +++++-------
> > 3 files changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> > index 60b1857f469e..407456ac0e84 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> > @@ -987,6 +987,9 @@ bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev)
> > if (amdgpu_sriov_vf(adev))
> > return false;
> >
> > + if (amdgpu_device_supports_boco(adev_to_drm(adev)))
> > + return true;
> > +
> > #if IS_ENABLED(CONFIG_SUSPEND)
> > return pm_suspend_target_state != PM_SUSPEND_TO_IDLE;
> > #else
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index f5658359ff5c..d56b7a2bafa6 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2181,7 +2181,12 @@ static int amdgpu_device_ip_early_init(struct amdgpu_device *adev)
> >
> > if (!(adev->flags & AMD_IS_APU)) {
> > parent = pci_upstream_bridge(adev->pdev);
> > - adev->has_pr3 = parent ? pci_pr3_present(parent) : false;
> > + do {
> > + if (pci_pr3_present(parent)) {
> > + adev->has_pr3 = true;
> > + break;
> > + }
> > + } while ((parent = pci_upstream_bridge(parent)));
> > }
> >
> > amdgpu_amdkfd_device_probe(adev);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index ba5def374368..5d81fcac4b0a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -2415,10 +2415,11 @@ static int amdgpu_pmops_suspend(struct device *dev)
> > struct drm_device *drm_dev = dev_get_drvdata(dev);
> > struct amdgpu_device *adev = drm_to_adev(drm_dev);
> >
> > - if (amdgpu_acpi_is_s0ix_active(adev))
> > - adev->in_s0ix = true;
> > - else if (amdgpu_acpi_is_s3_active(adev))
> > + if (amdgpu_acpi_is_s3_active(adev) ||
> > + amdgpu_device_supports_boco(drm_dev))
> > adev->in_s3 = true;
> > + else if (amdgpu_acpi_is_s0ix_active(adev))
> > + adev->in_s0ix = true;
> > if (!adev->in_s0ix && !adev->in_s3)
> > return 0;
> > return amdgpu_device_suspend(drm_dev, true);
> > @@ -2449,10 +2450,7 @@ static int amdgpu_pmops_resume(struct device *dev)
> > adev->no_hw_access = true;
> >
> > r = amdgpu_device_resume(drm_dev, true);
> > - if (amdgpu_acpi_is_s0ix_active(adev))
> > - adev->in_s0ix = false;
> > - else
> > - adev->in_s3 = false;
> > + adev->in_s0ix = adev->in_s3 = false;
> > return r;
> > }
> >