Re: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure

From: PJ Waskiewicz
Date: Mon Apr 08 2024 - 15:30:07 EST


On 24/04/08 09:34AM, Jonathan Cameron wrote:
> On Sun, 7 Apr 2024 19:03:23 -0700
> PJ Waskiewicz <ppwaskie@xxxxxxxxxx> wrote:
>
> > On 24/04/07 11:28PM, Lukas Wunner wrote:
> >
> > Hi Lukas,
> >
> > > On Sun, Apr 07, 2024 at 02:05:26PM -0700, ppwaskie@xxxxxxxxxx wrote:
> > > > --- a/drivers/cxl/acpi.c
> > > > +++ b/drivers/cxl/acpi.c
> > > > @@ -504,7 +504,7 @@ static int cxl_get_chbs(struct device *dev, struct acpi_device *hb,
> > > >
> > > > rc = acpi_evaluate_integer(hb->handle, METHOD_NAME__UID, NULL, &uid);
> > > > if (rc != AE_OK) {
> > > > - dev_err(dev, "unable to retrieve _UID\n");
> > > > + dev_err(dev, "unable to retrieve _UID. Potentially buggy BIOS\n");
> > > > return -ENOENT;
> > > > }
> > >
> > > dev_err(dev, FW_BUG "unable to retrieve _UID\n");
> > > ^^^^^^
> > >
> > > There's a macro for that.
> >
> > Doh...it's been awhile since I've crossed buggy BIOS's. Thanks for the
> > review and comment.
> >
> > Updated patch:
> >
> > cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure
> >
> > From: PJ Waskiewicz <ppwaskie@xxxxxxxxxx>
> >
> > Currently, Type 3 CXL devices (CXL.mem) can train using host CXL
> > drivers on Emerald Rapids systems. However, on some production
> > systems from some vendors, a buggy BIOS exists that improperly
> > populates the ACPI => PCI mappings. This leads to the cxl_acpi
> > driver to fail probe when it cannot find the root port's _UID, in
> > order to look up the device's CXL attributes in the CEDT.
> >
> > Add a bit more of a descriptive message that the lookup failure
> > could be a bad BIOS, rather than just "failed."
> >
> > v2: Updated message to use existing FW_BUG macro
> Move the change log "v2..." etc below the ---
> as we don't want it in the long term git log + better to send a fresh
> patch in a separate thread.

Thanks, it's been awhile, and my normal (i.e. old) workflow isn't
available to me just quite yet. I can fix and send a new patch, but
I'll hold off and see what Dan's thoughts are after my reply to his
reply.

> Other than that seems reasonable to hint it is probably a bios
> bug - however I wonder how many other cases we should do this for and
> whether it is worth the effort of marking them all?

I can confirm this was definitely a BIOS bug in this particular case.
The vendor spun a quick test BIOS for us to test on an EMR and SPR host,
and the _UID's were finally correct. I could successfully walk the CEDT
and get to the CAPS structs I was after (link speed, bus width, etc.).

I'd be fine also marking the others, but I don't have any easy way to
validate that I'd hit those cases. My BIOS for this platform is only
minorly broken. I suppose it could be mocked in QEMU to cause those to
fail...

-PJ