Re: [PATCH 0/7] cxl: Consolidate cxlmd->endpoint accessing
From: Alison Schofield
Date: Tue Mar 10 2026 - 15:22:02 EST
On Tue, Mar 10, 2026 at 11:57:52PM +0800, Li Ming wrote:
> Currently, CXL subsystem implementation has some functions that may
> access CXL memdev's endpoint before the endpoint initialization
> completed or without checking the CXL memdev endpoint validity.
> This patchset fixes three scenarios as above description.
Hi Ming,
I notice you didn't give each patch a fixes tag, I'll comment
in that per patch.
Below - a couple of questions about reproducing these failures:
>
> 1. cxl_dpa_to_region() is possible to access an invalid CXL memdev
> endpoint.
> there are two scenarios that can trigger this issue:
> a. memdev poison injection/clearing debugfs interfaces:
> devm_cxl_add_endpoint() is used to register CXL memdev endpoint
> and update cxlmd->endpoint from -ENXIO to the endpoint structure.
> memdev poison injection/clearing debugfs interfaces are registered
> before devm_cxl_add_endpoint() is invoked in cxl_mem_probe().
> There is a small window where user can use the debugfs interfaces
> to access an invalid endpoint.
Do you have a reproducer for this 1.a above?
Meson unit test does support parallel execution, so we could create
a race btw poison inject and region reconfig. It's an interesting task,
yet low priority considering the debugfs feature is for our expert users.
If you do have something, please share.
> b. cxl_event_config() in the end of cxl_pci_probe():
> cxl_event_config() invokes cxl_mem_get_event_record() to get
> remain event logs from CXL device during cxl_pci_probe(). If CXL
> memdev probing failed before that, it is also possible to access
> an invalid endpoint.
b. is reproducible.
Since your original find of this, I've been thinking about adding a
new group of tests, like the 'stressors', so that we can do things
like this on demand, but not as default in the existing cxl suite.
> To fix these two cases, cxl_dpa_to_region() requires callers holding
> CXL memdev lock to access it and check if CXL memdev driver bingding
> status. Holding CXL memdev lock ensures that CXL memdev probing has
> completed, and if CXL memdev driver is bound, it will mean
> cxlmd->endpoint is valid. (PATCH #1-#5)
>
> 2. cxl_reset_done() callback in cxl_pci module.
> cxl_reset_done() callback also accesses cxlmd->endpoint without any
> checking. If CXL memdev probing fails, then cxl_reset_done() is
> called by PCI subsystem, it will access an invalid endpoint. The
> solution is adding a CXL memdev driver binding status inside
> cxl_reset_done(). (PATCH #6)
Is 2. above reproducible, by inspection, or something else?
>
> Besides, the patchset also includes a fix for cxlmd->endpoint reset,
> cxlmd->endpoint is set to -ENXIO by default during cxlmd allocation. It
> will be updated when endpoint is allocated and added to the bus.
> However, the CXL driver does not reset it to -ENXIO when the endpoint is
> released. (PATCH #7)
>
> ---
> Li Ming (7):
> driver core: Add conditional guard support for device_lock()
> cxl/memdev: Hold memdev lock during memdev poison injection/clear
> cxl/region: Hold memdev lock during region poison injection/clear
> cxl/pci: Hold memdev lock in cxl_event_trace_record()
> cxl/region: Ensure endpoint is valid in cxl_dpa_to_region()
> cxl/pci: Check memdev driver binding status in cxl_reset_done()
> cxl/port: Reset cxlmd->endpoint to -ENXIO by default
>
> drivers/cxl/core/core.h | 4 +-
> drivers/cxl/core/mbox.c | 5 ++-
> drivers/cxl/core/memdev.c | 10 +++++
> drivers/cxl/core/port.c | 14 ++++--
> drivers/cxl/core/region.c | 112 ++++++++++++++++++++++++++++++++++------------
> drivers/cxl/cxlmem.h | 2 +-
> drivers/cxl/pci.c | 3 ++
> include/linux/device.h | 1 +
> 8 files changed, 114 insertions(+), 37 deletions(-)
> ---
> base-commit: 11439c4635edd669ae435eec308f4ab8a0804808
> change-id: 20260308-fix_access_endpoint_without_drv_check-f2e6ff4bdc48
>
> Best regards,
> --
> Li Ming <ming.li@xxxxxxxxxxxx>
>