Re: [PATCH v3] PCI: vmd: Honor ACPI _OSC on PCIe features

From: Jonathan Derrick
Date: Thu Feb 10 2022 - 12:59:52 EST




On 2/9/2022 2:36 PM, Bjorn Helgaas wrote:
On Tue, Dec 07, 2021 at 02:15:04PM +0100, Rafael J. Wysocki wrote:
On Tue, Dec 7, 2021 at 12:12 AM Keith Busch <kbusch@xxxxxxxxxx> wrote:
On Fri, Dec 03, 2021 at 11:15:41AM +0800, Kai-Heng Feng wrote:
When Samsung PCIe Gen4 NVMe is connected to Intel ADL VMD, the
combination causes AER message flood and drags the system performance
down.

The issue doesn't happen when VMD mode is disabled in BIOS, since AER
isn't enabled by acpi_pci_root_create() . When VMD mode is enabled, AER
is enabled regardless of _OSC:
[ 0.410076] acpi PNP0A08:00: _OSC: platform does not support [AER]
...
[ 1.486704] pcieport 10000:e0:06.0: AER: enabled with IRQ 146

Since VMD is an aperture to regular PCIe root ports, honor ACPI _OSC to
disable PCIe features accordingly to resolve the issue.

At least for some versions of this hardare, I recall ACPI is unaware of
any devices in the VMD domain; the platform can not see past the VMD
endpoint, so I throught the driver was supposed to always let the VMD
domain use OS native support regardless of the parent's ACPI _OSC.

This is orthogonal to whether or not ACPI is aware of the VMD domain
or the devices in it.

If the platform firmware does not allow the OS to control specific
PCIe features at the physical host bridge level, that extends to the
VMD "bus", because it is just a way to expose a hidden part of the
PCIe hierarchy.

I don't understand what's going on here. Do we understand the AER
message flood? Are we just papering over it by disabling AER?

If an error occurs below a VMD, who notices and reports it? If we
disable native AER below VMD because of _OSC, as this patch does, I
guess we're assuming the platform will handle AER events below VMD.
Is that really true? Does the platform know how to find AER log
registers of devices below VMD?
ACPI (and the specific UEFI implementation) might remain unaware of
VMD domains. It's possible that the system management mode (SMM)
controller which typically handles firmware-first errors would be
capable of handling VMD errors in the vendor-specific manner.
However if _OSC hadn't taken into account VMD ports, SMM wouldn't
be capable of handling those errors and silently disabling AER on
VMD domains is a bad idea.

The bugzilla made it sound like a specific platform/drive combination.
What about a DMI match to mask the Corrected Physical Layer bits?


The platform firmware does that through ACPI _OSC under the host
bridge device (not under the VMD device) which it is very well aware
of.