Re: [PATCH] x86/pci: Stop requiring MMCONFIG to be declared in E820, ACPI or EFI for newer systems
From: Bjorn Helgaas
Date: Thu Dec 14 2023 - 18:31:16 EST
On Thu, Dec 14, 2023 at 03:45:43PM -0600, Mario Limonciello wrote:
> On 12/14/2023 14:43, Bjorn Helgaas wrote:
> > On Tue, Dec 05, 2023 at 12:28:44PM -0600, Mario Limonciello wrote:
> > > On 12/5/2023 11:31, Bjorn Helgaas wrote:
> > > > On Tue, Dec 05, 2023 at 11:00:31AM -0600, Mario Limonciello wrote:
> > > > > On 12/5/2023 10:17, Bjorn Helgaas wrote:
> > > > > > On Tue, Dec 05, 2023 at 09:48:45AM -0600, Mario Limonciello wrote:
> > > > > > > commit 7752d5cfe3d1 ("x86: validate against acpi motherboard
> > > > > > > resources") introduced checks for ensuring that MCFG table
> > > > > > > also has memory region reservations to ensure no conflicts
> > > > > > > were introduced from a buggy BIOS.
> > > > > > >
> > > > > > > This has proceeded over time to add other types of
> > > > > > > reservation checks for ACPI PNP resources and EFI MMIO
> > > > > > > memory type. The PCI firmware spec however says that these
> > > > > > > checks are only required when the operating system doesn't
> > > > > > > comprehend the firmware region:
> > > > > > >
> > > > > > > ``` If the operating system does not natively comprehend
> > > > > > > reserving the MMCFG region, the MMCFG region must be
> > > > > > > reserved by firmware. The address range reported in the MCFG
> > > > > > > table or by _CBA method (see Section 4.1.3) must be reserved
> > > > > > > by declaring a motherboard resource. For most systems, the
> > > > > > > motherboard resource would appear at the root of the ACPI
> > > > > > > namespace (under \_SB) in a node with a _HID of EISAID
> > > > > > > (PNP0C02), and the resources in this case should not be
> > > > > > > claimed in the root PCI bus’s _CRS. The resources can
> > > > > > > optionally be returned in Int15 E820h or EFIGetMemoryMap as
> > > > > > > reserved memory but must always be reported through ACPI as
> > > > > > > a motherboard resource. ```
> > > > > >
> > > > > > My understanding is that native comprehension would mean Linux
> > > > > > knows how to discover and/or configure the MMCFG base address
> > > > > > and size in the hardware and that Linux would then reserve
> > > > > > that region so it's not used for anything else.
> > > > > >
> > > > > > Linux doesn't have that, at least for x86. It relies on the
> > > > > > MCFG table to discover the MMCFG region, and it relies on
> > > > > > PNP0C02 _CRS to reserve it.
> > > > >
> > > > > MCFG to discover it matches the PCI firmware spec, but as I
> > > > > point out above the decision to reserve this region doesn't
> > > > > require PNP0C01/PNP0C02 _CRS.
> > > >
> > > > Can you explain this reasoning a little more? I claim Linux does
> > > > not natively comprehend reserving the MMCFG region, but it sounds
> > > > like you don't agree? I think "native" comprehension would mean
> > > > Linux would not need the MCFG table.
> > >
> > > After our thread and the spec again I think you're right Linux
> > > doesn't natively comprehend (reserve this region;) particularly
> > > because of the stance you have on "static table" vs _CRS.
> >
> > ["My stance" refers to this:
> >
> > Static tables like MCFG, HPET, ECDT, etc., are *not* mechanisms for
> > reserving address space. The static tables are for things the OS
> > needs to know early in boot, before it can parse the ACPI namespace.
> > If a new table is defined, an old OS needs to operate correctly even
> > though it ignores the table. _CRS allows that because it is generic
> > and understood by the old OS; a static table does not.
> >
> > from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/PCI/acpi-info.rst?id=v6.6#n32]
> >
> > I don't think this is just my stance. The ACPI spec could be clearer
> > in terms of requiring PNP0C02 devices, not static tables, to reserve
> > address space, but I think that requirement is a logical consequence
> > of the ACPI design.
> >
> > It's a goal of ACPI that an OS we release today should run on a
> > platform released tomorrow. If the new platform uses a static table
> > to reserve address space used by some new hardware, today's OS doesn't
> > know about it and could place another device on top of it.
> >
> > Using _CRS of an ACPI device to reserve the new hardware address space
> > is different because it works even with today's OS. Today's OS can't
> > *operate* tomorrow's hardware, but at least it won't create address
> > conflicts with it.
> >
> > > I just don't want to throw the vendor under the bus as it could have
> > > been caught "sooner" and fixed by BIOS adding _CRS.
> >
> > The MCFG requirement for PNP0C02 _CRS has been in the PCI Firmware
> > spec since r3.0 in 2005. I'm surprised that vendors still get this
> > wrong.
>
> Probably worth mentioning with the clairvoyance of the root cause of
> the issue that prompted this conversation I've now discovered
> another system with the exact same problem. It's a different OEM,
> different generation of hardware and a different IBV that they use
> for their firmware.
>
> I've also looked through the BIOS for a variety of reference designs
> and I don't see a _CRS entry in any of them.
>
> I'm fairly certain we're just getting lucky in Linux on a lot of
> devices that the region is often overlapping with a region for EFI
> runtime services.
Ugh. Yes, I'm sure it's not an isolated problem.
> > Vendors definitely have an interest in making shipping OSes boot
> > unchanged on new hardware.
>
> At least the OEMs that I talk to use FWTS. FWTS catches this issue,
> but it's marked as LOW. Everyone fixates on the HIGH or CRITICAL.
>
> Given the severity of what I've seen it can do to a system I'm
> proposing FWTS to move it to HIGH:
>
> https://lists.ubuntu.com/archives/fwts-devel/2023-December/013772.html
Thanks. I don't know anything about FWTS, but I'm a little skeptical
that it actually catches this issue. It *looks* like FWTS builds its
idea of the memory map from a dmesg log or /sys/firmware/memmap, which
I think both come from the E820 map, which is x86-specific, of course.
I don't see anything that builds a memory map based on _CRS methods,
which I think is what we really want since the spec says:
The resources can optionally be returned in Int15 E820h or
EFIGetMemoryMap as reserved memory but must always be reported
through ACPI as a motherboard resource.
(PCI Firmware spec r3.3, sec 4.1.2)
> What is the actual *harm* in just using this MCFG table to make a
> reservation when there isn't a PNP0C02 _CRS region declared?
>
> At worst (a buggy BIOS) you would end up with hole in the memory map
> that isn't usable for devices. At best you end up with more working
> devices without changing the firmware.
We definitely need to work around this in Linux, and your patch might
well be the right thing.
I'm a *little* hesitant because all the code in mmconfig-shared.c that
attempts to validate MCFG entries suggests that relying on them
uncritically was a problem in some cases, so I want to try to convince
myself that we really won't break something.
Bjorn