Re: [PATCH v3 1/3] cxl/pci: Fix appropriate checking for _OSC while handling CXL RAS registers

From: Dave Jiang
Date: Wed Aug 16 2023 - 17:40:35 EST




On 8/16/23 14:33, Smita Koralahalli wrote:
On 8/16/2023 11:06 AM, Dave Jiang wrote:


On 8/4/23 05:09, Jonathan Cameron wrote:
On Thu, 3 Aug 2023 23:01:27 +0000
Smita Koralahalli <Smita.KoralahalliChannabasappa@xxxxxxx> wrote:

According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
of AER should also own CXL Protocol Error Management as there is no
explicit control of CXL Protocol error. And the CXL RAS Cap registers
reported on Protocol errors should check for AER _OSC rather than CXL
Memory Error Reporting Control _OSC.

The CXL Memory Error Reporting Control _OSC specifically highlights
handling Memory Error Logging and Signaling Enhancements. These kinds of
errors are reported through a device's mailbox and can be managed
independently from CXL Protocol Errors.

This change fixes handling and reporting CXL Protocol Errors and RAS
registers natively with native AER and FW-First CXL Memory Error Reporting
Control.

[1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.

Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@xxxxxxx>
Reviewed-by: Robert Richter <rrichter@xxxxxxx>

I'd be tempted to add a comment on why this returns 0 rather than an
error.  I think that makes sense but it isn't immediately obvious from
the local context.

Otherwise LGTM

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>

Echo Jonathan's comment.

Reviewed-by: Dave Jiang <dave.jiang@xxxxxxxxx>

Yes, and Dan is probably against returning error code.
https://lore.kernel.org/all/64d1b3e78629f_5ea6e2944@xxxxxxxxxxxxxxxxxxxxxxxxx.notmuch/

But I think returning zero is required as we don't want to interfere with cxl device access when operating in native cxl memory error reporting. Returning error code will basically fail cxl_pci_probe() and thus fail to create a cxl device node.

I was thinking a single line comment as: "Return zero to not block the communication with the cxl device when in native memory error reporting mode".

Looks reasonable to me. Unless Dan feels it needs to fail the probe.


Agree? Or anything more that needs to be added?

Thanks,
Smita



---
v2:
    Added fixes tag.
    Included what the patch fixes in commit message.
v3:
    Added "Reviewed-by" tag.
---
  drivers/cxl/pci.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 1cb1494c28fe..2323169b6e5f 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
          return 0;
      }
-    /* BIOS has CXL error control */
-    if (!host_bridge->native_cxl_error)
-        return -ENXIO;
+    /* BIOS has PCIe AER error control */
+    if (!host_bridge->native_aer)
+        return 0;
      rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
      if (rc)