Re: [PATCH] x86/pci: Stop requiring MMCONFIG to be declared in E820, ACPI or EFI for newer systems

From: Mario Limonciello
Date: Thu Dec 14 2023 - 16:47:34 EST


On 12/14/2023 14:43, Bjorn Helgaas wrote:
[+cc Len, Robert in case I'm missing something about static tables
reserving address space]

On Tue, Dec 05, 2023 at 12:28:44PM -0600, Mario Limonciello wrote:
On 12/5/2023 11:31, Bjorn Helgaas wrote:
On Tue, Dec 05, 2023 at 11:00:31AM -0600, Mario Limonciello wrote:
On 12/5/2023 10:17, Bjorn Helgaas wrote:
On Tue, Dec 05, 2023 at 09:48:45AM -0600, Mario Limonciello wrote:
commit 7752d5cfe3d1 ("x86: validate against acpi motherboard
resources") introduced checks for ensuring that MCFG table
also has memory region reservations to ensure no conflicts
were introduced from a buggy BIOS.

This has proceeded over time to add other types of
reservation checks for ACPI PNP resources and EFI MMIO
memory type. The PCI firmware spec however says that these
checks are only required when the operating system doesn't
comprehend the firmware region:

``` If the operating system does not natively comprehend
reserving the MMCFG region, the MMCFG region must be
reserved by firmware. The address range reported in the MCFG
table or by _CBA method (see Section 4.1.3) must be reserved
by declaring a motherboard resource. For most systems, the
motherboard resource would appear at the root of the ACPI
namespace (under \_SB) in a node with a _HID of EISAID
(PNP0C02), and the resources in this case should not be
claimed in the root PCI bus’s _CRS. The resources can
optionally be returned in Int15 E820h or EFIGetMemoryMap as
reserved memory but must always be reported through ACPI as
a motherboard resource. ```

My understanding is that native comprehension would mean Linux
knows how to discover and/or configure the MMCFG base address
and size in the hardware and that Linux would then reserve
that region so it's not used for anything else.

Linux doesn't have that, at least for x86. It relies on the
MCFG table to discover the MMCFG region, and it relies on
PNP0C02 _CRS to reserve it.

MCFG to discover it matches the PCI firmware spec, but as I
point out above the decision to reserve this region doesn't
require PNP0C01/PNP0C02 _CRS.

Can you explain this reasoning a little more? I claim Linux does
not natively comprehend reserving the MMCFG region, but it sounds
like you don't agree? I think "native" comprehension would mean
Linux would not need the MCFG table.

After our thread and the spec again I think you're right Linux
doesn't natively comprehend (reserve this region;) particularly
because of the stance you have on "static table" vs _CRS.

["My stance" refers to this:

Static tables like MCFG, HPET, ECDT, etc., are *not* mechanisms for
reserving address space. The static tables are for things the OS
needs to know early in boot, before it can parse the ACPI namespace.
If a new table is defined, an old OS needs to operate correctly even
though it ignores the table. _CRS allows that because it is generic
and understood by the old OS; a static table does not.

from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/PCI/acpi-info.rst?id=v6.6#n32]

I don't think this is just my stance. The ACPI spec could be clearer
in terms of requiring PNP0C02 devices, not static tables, to reserve
address space, but I think that requirement is a logical consequence
of the ACPI design.

It's a goal of ACPI that an OS we release today should run on a
platform released tomorrow. If the new platform uses a static table
to reserve address space used by some new hardware, today's OS doesn't
know about it and could place another device on top of it.

Using _CRS of an ACPI device to reserve the new hardware address space
is different because it works even with today's OS. Today's OS can't
*operate* tomorrow's hardware, but at least it won't create address
conflicts with it.

I just don't want to throw the vendor under the bus as it could have
been caught "sooner" and fixed by BIOS adding _CRS.

The MCFG requirement for PNP0C02 _CRS has been in the PCI Firmware
spec since r3.0 in 2005. I'm surprised that vendors still get this
wrong.

Probably worth mentioning with the clairvoyance of the root cause of the issue that prompted this conversation I've now discovered another system with the exact same problem. It's a different OEM, different generation of hardware and a different IBV that they use for their firmware.

I've also looked through the BIOS for a variety of reference designs and I don't see a _CRS entry in any of them.

I'm fairly certain we're just getting lucky in Linux on a lot of devices that the region is often overlapping with a region for EFI runtime services.

Vendors definitely have an interest in making shipping OSes
boot unchanged on new hardware.

At least the OEMs that I talk to use FWTS. FWTS catches this issue, but it's marked as LOW. Everyone fixates on the HIGH or CRITICAL.

Given the severity of what I've seen it can do to a system I'm proposing FWTS to move it to HIGH:

https://lists.ubuntu.com/archives/fwts-devel/2023-December/013772.html


Knowing Windows works without it I feel this is still something that we
should be looking at fixing from an upstream perspective though which is
what prompted my patch and discussion.

The fact that Windows works doesn't mean the firmware is correct.

Of course. But it also means it's worth looking at the semantics of why Windows works.

Linux assigns PCI BARs from the bottom up, and ECAM is often at the
bottom of a host bridge aperture.

Windows assigns PCI BARs from the top down, so even without a _CRS
reservation for the ECAM space, Windows is much less likely to put
something on top of it.

Bjorn


I guess I'm naïve in that I don't know how exactly to check what Windows
*really* does (gets lucky from top down or actually reserves from MCFG).

But I do feel the OS doing it from top down vs bottom up is tangential to the decision of whether or not make a reservation in Linux for something you know about without a _CRS entry.

I can push AMD's reference designs to improve, I can push OEMs I talk to improve and I can try to influence FWTS but I have to ask:

What is the actual *harm* in just using this MCFG table to make a reservation when there isn't a PNP0C02 _CRS region declared?

At worst (a buggy BIOS) you would end up with hole in the memory map that isn't usable for devices. At best you end up with more working devices without changing the firmware.