Re: [PATCH] cxl/region: Fix a race bug in delete_region_store
From: Dan Williams
Date: Tue Mar 10 2026 - 18:54:01 EST
Sungwoo Kim wrote:
> A race exists when two concurrent sysfs writes to delete_region specify
> the same region name. Both calls succeed in cxl_find_region_by_name()
> (which only does device_find_child_by_name and takes a reference), and
> both then proceed to call devm_release_action(). The first call atomically
> removes and releases the devres entry successfully. The second call finds
> no matching entry, causing devres_release() to return -ENOENT, which trips
> the WARN_ON.
Ugh, yes, good find.
I think I understand the pathology of how this happened. Touching devres
actions needs to happen under device lock or with the knowledge that the
device hosting the action is currently bound to a driver. The
port->uport_dev is known to be bound in this path because decoder
devices are only registered in the bound lifetime of ->uport_dev.
The device_lock() is not needed to make sure the release action can be
called, however, *some* synchronization is needed for racing requesters
as you found.
> Fix this by replacing devm_release_action() with devm_remove_action_nowarn()
> followed by a manual call to unregister_region(). devm_remove_action_nowarn()
> removes the devres tracking entry and returns an error code.
No, that is not an acceptable or comprehensive fix. A subsystem should
never try to double unregister a device. Keep in mind the decoder sysfs
stays alive while the devres_release_all() actions are running for
port->uport_dev. As a result, yes, Davidlohr is right. You can
theoretically observe the unregister_region() release action firing
while delete_region_store() is running.
The only comprehensive fix I currently see to that problem is to indeed
get it back synchronized under the device lock. This prevents multiple
requesters from colliding, and from devres_release_all() deleting an
action that delete_region_store() already committed to handling.
This looks something like schedule the devres action to be released in
workqueue under guard(device)(&port->udev).
It is ugly, but it may be the case that this wants to synchronously
delete the region, and asynchronously cleanup the release action.
I.e. unregister_region() grows a new:
if (!test_and_set_bit(CXL_REGION_F_DELETE, ...)
...flag to enforce only one path successfully deletes.
delete_region_store() calls unregister_region() directly. Then the
workqueue's job is to trigger the release action under the lock only if
port->uport_dev is still driver-attached by the time the workqueue runs.
In the meantime the workqueue needs to hold a reference on the region
device until it can observe the state of CXL_REGION_F_DELETE under the
device lock.
Otherwise I suspect if you put a device_lock() in delete_region_store()
it will deadlock.
> ------------[ cut here ]------------
> WARNING: drivers/base/devres.c:824 at devm_release_action drivers/base/devres.c:824 [inline], CPU#0: syz.1.12224/47589
> WARNING: drivers/base/devres.c:824 at devm_release_action+0x2b2/0x360 drivers/base/devres.c:817, CPU#0: syz.1.12224/47589
For the next posting, these 2 lines are sufficient, no need for all the
extra detail of a full backtrace.