Re: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure
From: PJ Waskiewicz
Date: Mon Apr 08 2024 - 15:25:10 EST
On 24/04/08 09:54AM, Dan Williams wrote:
> ppwaskie@ wrote:
> > From: PJ Waskiewicz <ppwaskie@xxxxxxxxxx>
> >
> > Currently, Type 3 CXL devices (CXL.mem) can train using host CXL
> > drivers on Emerald Rapids systems. However, on some production
> > systems from some vendors, a buggy BIOS exists that improperly
> > populates the ACPI => PCI mappings. This leads to the cxl_acpi
> > driver to fail probe when it cannot find the root port's _UID, in
> > order to look up the device's CXL attributes in the CEDT.
> >
> > Add a bit more of a descriptive message that the lookup failure
> > could be a bad BIOS, rather than just "failed."
>
> Makes sense, but is the goal here to name and shame the BIOS, or find a
> potential quirk workaround? Presumably we could fall back to parsing
> _UID instead of a string and then get some guidance from said BIOS about
> how to lookup the corresponding ACPI0016 device from that identifier.
In this particular case, I tried making sense of what was the _UID
value, and what was actually in the CEDT. There was no sense to be
made.
For this device, it was ACPI0016:02 with a _UID of CX02. For this
particular vendor BIOS, all ACPI0016:* devices' _UID's counted up from
CX01 => CX* sequentially. But what was actually in the CEDT in this
particular case for ACPI0016:02 was 49. I attempted hex, octal, atoi(),
literal string interpretation per-character, etc. It was just plain
wrong.
> In other words, I see this patch as a warning shot of, "hey,
> $platform_vendor if you
> don't want folks to RMA these platforms please tell us how to do the
> association Linux expects per the spec". Otherwise, this can escalate to
> a loud WARN_TAINT(TAINT_FIRMWARE_WORKAROUND...), but I first want more
> details from this platform like an acpidump and the exact error code
> acpi_evaluate_integer() is returning.
Pasting an acpidump is difficult... It'll be tricky since this particular
host is walled off from the world. And moving data in and out of this
environment is quite challenging due to regulatory reasons.
acpi_evaluate_integer() in this case was returning AE_BUFFER_OVERFLOW.
In the meantime, I'm fine either fixing up the commit message per
Jonathan's review, or I'm fine shelving it in favor of a broader effort
to fix the underlying BIOS's with the vendors. I don't have a strong
preference. I've been in the weeds with this for awhile, so I know why
it's breaking. But someone new to CXL with shiny new hardware may be
left scratching their heads.
-PJ