Re: [PATCH 0/7] cxl: Consolidate cxlmd->endpoint accessing

From: Li Ming

Date: Wed Mar 11 2026 - 06:53:46 EST

在 2026/3/11 03:20, Alison Schofield 写道:

On Tue, Mar 10, 2026 at 11:57:52PM +0800, Li Ming wrote:

Currently, CXL subsystem implementation has some functions that may
access CXL memdev's endpoint before the endpoint initialization
completed or without checking the CXL memdev endpoint validity.
This patchset fixes three scenarios as above description.

Hi Ming,

I notice you didn't give each patch a fixes tag, I'll comment
in that per patch.

Below - a couple of questions about reproducing these failures:

Thank you Alison.

1. cxl_dpa_to_region() is possible to access an invalid CXL memdev
endpoint.
there are two scenarios that can trigger this issue:
a. memdev poison injection/clearing debugfs interfaces:
devm_cxl_add_endpoint() is used to register CXL memdev endpoint
and update cxlmd->endpoint from -ENXIO to the endpoint structure.
memdev poison injection/clearing debugfs interfaces are registered
before devm_cxl_add_endpoint() is invoked in cxl_mem_probe().
There is a small window where user can use the debugfs interfaces
to access an invalid endpoint.

Do you have a reproducer for this 1.a above?
Meson unit test does support parallel execution, so we could create
a race btw poison inject and region reconfig. It's an interesting task,
yet low priority considering the debugfs feature is for our expert users.
If you do have something, please share.

1.a is a theoretical issue, I added a delay between memdev debugfs interfaces creating and devm_cxl_add_endpoint() so that I have a chance to access the debugfs interfaces before endpoint initialization manually.

Then I can trigger below trace. Maybe use script to write the intetface when the interfaces showing up is a choice?

Oops: general protection fault, probably for non-canonical address 0xdffffc000000008b: 0000 [#1] SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000458-0x000000000000045f]
CPU: 1 UID: 0 PID: 811 Comm: bash Tainted: G O 7.0.0-rc2-debug-dirty #19 PREEMPT(full)
Tainted: [O]=OOT_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2024.02-2ubuntu0.7 11/27/2025
RIP: 0010:cxl_dpa_to_region+0xa2/0x130 [cxl_core]
Call Trace:
<TASK>
cxl_inject_poison_locked+0x1d7/0x620 [cxl_core]
cxl_inject_poison+0x8a/0xe0 [cxl_core]

....
debugfs_attr_write+0x62/0x90

b. cxl_event_config() in the end of cxl_pci_probe():
cxl_event_config() invokes cxl_mem_get_event_record() to get
remain event logs from CXL device during cxl_pci_probe(). If CXL
memdev probing failed before that, it is also possible to access
an invalid endpoint.

b. is reproducible.
Since your original find of this, I've been thinking about adding a
new group of tests, like the 'stressors', so that we can do things
like this on demand, but not as default in the existing cxl suite.

It sounds good!

To fix these two cases, cxl_dpa_to_region() requires callers holding
CXL memdev lock to access it and check if CXL memdev driver bingding
status. Holding CXL memdev lock ensures that CXL memdev probing has
completed, and if CXL memdev driver is bound, it will mean
cxlmd->endpoint is valid. (PATCH #1-#5)

2. cxl_reset_done() callback in cxl_pci module.
cxl_reset_done() callback also accesses cxlmd->endpoint without any
checking. If CXL memdev probing fails, then cxl_reset_done() is
called by PCI subsystem, it will access an invalid endpoint. The
solution is adding a CXL memdev driver binding status inside
cxl_reset_done(). (PATCH #6)

Is 2. above reproducible, by inspection, or something else?

I reproduced it by forcedly letting cxl_mem_probe() fail before devm_cxl_add_endpoint(), and manully write the sysfs reset interface of the device to trigger it.

Ming

[snip]