Re: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup failure

From: PJ Waskiewicz
Date: Mon Apr 08 2024 - 17:33:14 EST


On 24/04/08 01:45PM, Dan Williams wrote:
> PJ Waskiewicz wrote:
> [..]
> > > Other than that seems reasonable to hint it is probably a bios
> > > bug - however I wonder how many other cases we should do this for and
> > > whether it is worth the effort of marking them all?
> >
> > I can confirm this was definitely a BIOS bug in this particular case.
> > The vendor spun a quick test BIOS for us to test on an EMR and SPR host,
> > and the _UID's were finally correct. I could successfully walk the CEDT
> > and get to the CAPS structs I was after (link speed, bus width, etc.).
>
> Oh, in that case I think there's no need to worry about a Linux quirk.

Copy that.

> I do think cxl_acpi has multiple overlapping failure cases when what you
> really want to know is whether:
>
> "CXL host bridge can be found in CEDT.CHBS"

Yes.

> Turns out that straightforward message is aleady a driver message, but
> it gets skipped in this case. So, I am thinking of cleanup /
> clarification along the following lines:
>
> 1/ Lean on the existing cxl_get_chbs() validation paths to report on
> errors
>
> 2/ Include the device-name rather than the UID since if UID is
> unreliable it does not help you communicate with your BIOS vendor. I.e.
> give a breadcrumb for the BIOS engineer to match the AML device name
> with the CEDT content.
>
> 3/ Do not fail driver load on a single host-bridge parsing failure
>
> 4/ These are all cxl_acpi driver init events, so consistently use the
> ACPI0017 device, and the cxl_acpi driver, as the originator of the error
> message.
>
> Would this clarification have saved you time with the debug?

Inline below, but yes, this would have helped sped up the debug quite a
bit. Since the initial debug was happening on SPR, and I couldn't use
the host drivers, I was pulling the logic from the CXL host drivers into
my drivers to skip the need for ACPI0017.

> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index 32091379a97b..5a70d7312c64 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -511,29 +511,26 @@ static int cxl_get_chbs_iter(union acpi_subtable_headers *header, void *arg,
> return 0;
> }
>
> -static int cxl_get_chbs(struct device *dev, struct acpi_device *hb,
> - struct cxl_chbs_context *ctx)
> +static void cxl_get_chbs(struct device *dev, struct acpi_device *hb,
> + struct cxl_chbs_context *ctx)
> {
> - unsigned long long uid;
> int rc;
>
> - rc = acpi_evaluate_integer(hb->handle, METHOD_NAME__UID, NULL, &uid);
> - if (rc != AE_OK) {
> - dev_err(dev, "unable to retrieve _UID\n");
> - return -ENOENT;
> - }
> -
> - dev_dbg(dev, "UID found: %lld\n", uid);
> - *ctx = (struct cxl_chbs_context) {
> + *ctx = (struct cxl_chbs_context){
> .dev = dev,
> - .uid = uid,
> .base = CXL_RESOURCE_NONE,
> .cxl_version = UINT_MAX,
> };
>
> - acpi_table_parse_cedt(ACPI_CEDT_TYPE_CHBS, cxl_get_chbs_iter, ctx);
> + rc = acpi_evaluate_integer(hb->handle, METHOD_NAME__UID, NULL,
> + &ctx->uid);
> + if (rc != AE_OK) {
> + dev_dbg(dev, "unable to retrieve _UID\n");
> + return;
> + }

Before I read the changes below, I didn't see why this was still
useful...but I do see it now in full context.

>
> - return 0;
> + dev_dbg(dev, "UID found: %lld\n", ctx->uid);
> + acpi_table_parse_cedt(ACPI_CEDT_TYPE_CHBS, cxl_get_chbs_iter, ctx);
> }
>
> static int get_genport_coordinates(struct device *dev, struct cxl_dport *dport)
> @@ -561,7 +558,6 @@ static int get_genport_coordinates(struct device *dev, struct cxl_dport *dport)
> static int add_host_bridge_dport(struct device *match, void *arg)
> {
> int ret;
> - acpi_status rc;
> struct device *bridge;
> struct cxl_dport *dport;
> struct cxl_chbs_context ctx;
> @@ -573,19 +569,16 @@ static int add_host_bridge_dport(struct device *match, void *arg)
> if (!hb)
> return 0;
>
> - rc = cxl_get_chbs(match, hb, &ctx);
> - if (rc)
> - return rc;
> -
> + cxl_get_chbs(match, hb, &ctx);
> if (ctx.cxl_version == UINT_MAX) {
> - dev_warn(match, "No CHBS found for Host Bridge (UID %lld)\n",
> - ctx.uid);
> + dev_err(host, FW_BUG "No CHBS found for Host Bridge (%s)\n",
> + dev_name(match));
> return 0;
> }

This would have been more obvious that the lookup failed for "reasons"
instead of AE_BUFFER_OVERFLOW (which led to the fun Alice-style LXR
spelunking).

>
> if (ctx.base == CXL_RESOURCE_NONE) {
> - dev_warn(match, "CHBS invalid for Host Bridge (UID %lld)\n",
> - ctx.uid);
> + dev_err(host, FW_BUG "CHBS invalid for Host Bridge (%s)\n",
> + dev_name(match));

I think this is a good catch-all too in case the lookup is good, but the
vendor borked filling it in properly.

> return 0;
> }
>
> @@ -650,13 +643,11 @@ static int add_host_bridge_uport(struct device *match, void *arg)
> return 0;
> }
>
> - rc = cxl_get_chbs(match, hb, &ctx);
> - if (rc)
> - return rc;
> -
> + cxl_get_chbs(match, hb, &ctx);
> if (ctx.cxl_version == ACPI_CEDT_CHBS_VERSION_CXL11) {
> - dev_warn(bridge,
> - "CXL CHBS version mismatch, skip port registration\n");
> + dev_err(host,
> + FW_BUG "CXL CHBS version mismatch, skip port registration for %s\n",
> + dev_name(match));
> return 0;
> }

I like your version here much better than mine. If you want to merge
that, then:

Reviewed-by: PJ Waskiewicz <ppwaskie@xxxxxxxxxx>