Re: [PATCH] cxl/region: Fix a race bug in delete_region_store

From: Sungwoo Kim

Date: Wed Mar 11 2026 - 02:55:58 EST

On Tue, Mar 10, 2026 at 6:54 PM Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
>
> Sungwoo Kim wrote:
> > Fix this by replacing devm_release_action() with devm_remove_action_nowarn()
> > followed by a manual call to unregister_region(). devm_remove_action_nowarn()
> > removes the devres tracking entry and returns an error code.
>
> No, that is not an acceptable or comprehensive fix. A subsystem should
> never try to double unregister a device. Keep in mind the decoder sysfs
> stays alive while the devres_release_all() actions are running for
> port->uport_dev. As a result, yes, Davidlohr is right. You can
> theoretically observe the unregister_region() release action firing
> while delete_region_store() is running.

I appreciate your review and suggestion of a new patch. It seems more
complicated than I expected!

I'm new to CXL and device drivers, so I would like to confirm that my
understanding is correct.
- CXL tolpology could be: [upstream] host -> switch(s) -> endpoints(s)
[downstream].
- The host can configure a unique address range (= region) to access
each device.
- The bug happened because releasing the same regions can race.
- Although the prior patch handled this, it's insufficient because it
can also race with releasing the device (what Davidlohr had reported).

>
> The only comprehensive fix I currently see to that problem is to indeed
> get it back synchronized under the device lock. This prevents multiple
> requesters from colliding, and from devres_release_all() deleting an
> action that delete_region_store() already committed to handling.
>
> This looks something like schedule the devres action to be released in
> workqueue under guard(device)(&port->udev).

I assume you meant guard(device)(port->uport_dev).

>
> It is ugly, but it may be the case that this wants to synchronously
> delete the region, and asynchronously cleanup the release action.
>
> I.e. unregister_region() grows a new:
>
> if (!test_and_set_bit(CXL_REGION_F_DELETE, ...)
>
> ...flag to enforce only one path successfully deletes.
> delete_region_store() calls unregister_region() directly. Then the

To fix this, you suggested:
- Add a flag to make unregister_region() idempotent.
- Make delete_region_store() call unregister_region() directly.

Now unregister_region() is race-free. But we need to release an action.

> workqueue's job is to trigger the release action under the lock only if
> port->uport_dev is still driver-attached by the time the workqueue runs.
>

What I'm missing is this part.
How could asynchronous cleanup solve the race? In the worst case, the
workqueue could start immediately after we schedule it. In this case,
it's the same as the synchronous execution.
Or do we have other reasons to make it asynchronous?

> In the meantime the workqueue needs to hold a reference on the region
> device until it can observe the state of CXL_REGION_F_DELETE under the
> device lock.

If it's the same for synchronous execution, we might execute this
directly in unregister_region(), holding
guard(device)(port->uport_dev).

>
> Otherwise I suspect if you put a device_lock() in delete_region_store()
> it will deadlock.
>
> > ------------[ cut here ]------------
> > WARNING: drivers/base/devres.c:824 at devm_release_action drivers/base/devres.c:824 [inline], CPU#0: syz.1.12224/47589
> > WARNING: drivers/base/devres.c:824 at devm_release_action+0x2b2/0x360 drivers/base/devres.c:817, CPU#0: syz.1.12224/47589
>
> For the next posting, these 2 lines are sufficient, no need for all the
> extra detail of a full backtrace.
>

Copy that.

Thanks again for your detailed reviews.
Sungwoo.