RE: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

From: Lazar, Lijo
Date: Mon Jan 24 2022 - 09:21:19 EST


[Public]

Not able to relate to how it affects gfx/mem DPM alone. Unless Alex has other ideas, would you be able to enable drm debug messages and share the log?

Enabling verbose debug messages is done through the drm.debug parameter, each category being enabled by a bit:

drm.debug=0x1 will enable CORE messages
drm.debug=0x2 will enable DRIVER messages
drm.debug=0x3 will enable CORE and DRIVER messages
...
drm.debug=0x1ff will enable all messages
An interesting feature is that it's possible to enable verbose logging at run-time by echoing the debug value in its sysfs node:

# echo 0xf > /sys/module/drm/parameters/debug

Thanks,
Lijo

-----Original Message-----
From: James Turner <linuxkernel.foss@xxxxxxxxxxxxxxxxxxxxxx>
Sent: Sunday, January 23, 2022 2:41 AM
To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
Cc: Alex Deucher <alexdeucher@xxxxxxxxx>; Thorsten Leemhuis <regressions@xxxxxxxxxxxxx>; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; regressions@xxxxxxxxxxxxxxx; kvm@xxxxxxxxxxxxxxx; Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx>; Pan, Xinhui <Xinhui.Pan@xxxxxxx>; LKML <linux-kernel@xxxxxxxxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Alex Williamson <alex.williamson@xxxxxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

Hi Lijo,

> Could you provide the pp_dpm_* values in sysfs with and without the
> patch? Also, could you try forcing PCIE to gen3 (through pp_dpm_pcie)
> if it's not in gen3 when the issue happens?

AFAICT, I can't access those values while the AMD GPU PCI devices are bound to `vfio-pci`. However, I can at least access the link speed and width elsewhere in sysfs. So, I gathered what information I could for two different cases:

- With the PCI devices bound to `vfio-pci`. With this configuration, I
can start the VM, but the `pp_dpm_*` values are not available since
the devices are bound to `vfio-pci` instead of `amdgpu`.

- Without the PCI devices bound to `vfio-pci` (i.e. after removing the
`vfio-pci.ids=...` kernel command line argument). With this
configuration, I can access the `pp_dpm_*` values, since the PCI
devices are bound to `amdgpu`. However, I cannot use the VM. If I try
to start the VM, the display (both the external monitors attached to
the AMD GPU and the built-in laptop display attached to the Intel
iGPU) completely freezes.

The output shown below was identical for both the good commit:
f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack") and the commit which introduced the issue:
f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")

Note that the PCI link speed increased to 8.0 GT/s when the GPU was under heavy load for both versions, but the clock speeds of the GPU were different under load. (For the good commit, it was 1295 MHz; for the bad commit, it was 501 MHz.)


# With the PCI devices bound to `vfio-pci`

## Before starting the VM

% ls /sys/module/amdgpu/drivers/pci:amdgpu
module bind new_id remove_id uevent unbind

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
8.0 GT/s PCIe

## While running the VM, before placing the AMD GPU under heavy load

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
2.5 GT/s PCIe

## While running the VM, with the AMD GPU under heavy load

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
8.0 GT/s PCIe

## While running the VM, after stopping the heavy load on the AMD GPU

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
2.5 GT/s PCIe

## After stopping the VM

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
2.5 GT/s PCIe


# Without the PCI devices bound to `vfio-pci`

% ls /sys/module/amdgpu/drivers/pci:amdgpu
0000:01:00.0 module bind new_id remove_id uevent unbind

% for f in /sys/module/amdgpu/drivers/pci:amdgpu/*/pp_dpm_*; do echo "$f"; cat "$f"; echo; done /sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_mclk
0: 300Mhz
1: 625Mhz
2: 1500Mhz *

/sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_pcie
0: 2.5GT/s, x8
1: 8.0GT/s, x16 *

/sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_sclk
0: 214Mhz
1: 501Mhz
2: 850Mhz
3: 1034Mhz
4: 1144Mhz
5: 1228Mhz
6: 1275Mhz
7: 1295Mhz *

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
8.0 GT/s PCIe


James