Re: [syzbot] [rdma?] WARNING in ib_dealloc_device
From: Jason Gunthorpe
Date: Tue Apr 14 2026 - 08:19:26 EST
On Tue, Apr 14, 2026 at 01:47:01PM +0300, Leon Romanovsky wrote:
> On Mon, Apr 13, 2026 at 02:42:28PM -0300, Jason Gunthorpe wrote:
> > On Mon, Apr 13, 2026 at 04:12:09PM +0000, Jiri Pirko wrote:
> > > Will check it tmrw
> >
> > I fed it to Claude and after 40 mins it is stumped too.. It should not
> > be possible for this to happen.
>
> Interesting, I used Chris's prompts for this debug and got the following
> suggestions (CONFIG_PREEMPT_RT=y in this .config):
>
> ------------------------------------------------------------------------
> REMAINING HYPOTHESES
> ------------------------------------------------------------------------
>
> 1. PREEMPT_RT rwsem behavior (most likely for syzkaller SOFTLOCKUP trigger):
> Under PREEMPT_RT, down_write/down_read use rt_mutex internally. Priority
> inheritance and preemption semantics differ from non-RT. There may be a
> window in the rwsem downgrade path inside enable_device_and_get (which
> downgrades from WRITE to READ after setting DEVICE_REGISTERED) that allows
> a concurrent disable_device to observe an inconsistent state.
Is this actually true? What is the point of implementing
downgrade_write like this?
> Specifically: enable_device_and_get does:
> down_write(devices_rwsem)
> xa_set_mark(DEVICE_REGISTERED)
> downgrade_write(devices_rwsem) [WRITE -> READ]
> add_compat_devs()
> up_read(devices_rwsem)
>
> Under PREEMPT_RT, could disable_device acquire WRITE between the xa_set_mark
> and downgrade_write? If so, it would clear DEVICE_REGISTERED while
> add_compat_devs is about to run (but hasn't yet seen the mark cleared).
This is half a thought, okay, so even if they race, the entry to
remove_compat_devs() is sill gated by
/* Pairs with refcount_set in enable_device */
ib_device_put(device);
wait_for_completion(&device->unreg_completion);
And we still have the refcount guarding it:
refcount_set(&device->refcount, 2);
down_write(&devices_rwsem);
xa_set_mark(&devices, device->index, DEVICE_REGISTERED);
So we can't race add_compat_devs and remove_compat_devs() like this
unless there is some way for the refcount to have been dropped to zero
also. I don't think there is.
> 2. xa_for_each skipping entries during concurrent xa_erase restructuring:
> If rdma_dev_exit_net's remove_one_compat_dev erases an entry concurrently
> with remove_compat_devs iterating, xas_shrink (called inside xa_erase) could
> restructure the xarray tree. If xa_find_after then traverses a restructured
> tree and skips a subsequent entry, that entry remains in compat_devs.
This race is also impossible due to the mark and the refcount.
> This is subtle: xa_erase takes the xarray spinlock (or rt_mutex), but
> xa_for_each calls xa_find_after under RCU. The RCU read side might see a
> partially-restructured tree that looks different from the spinlock-visible
> view. Under PREEMPT_RT, RCU critical sections can be longer.
>
> 3. rdma_compatdev_set (ib_devices_shared_netns sysctl) race:
> add_all_compat_devs() is guarded by DEVICE_REGISTERED + devices_rwsem, so
> the same analysis as T3a applies and the race is eliminated. However, if
> there is a remove_all_compat_devs() implementation, its interaction with
> the unregistration flow deserves verification.
Huh? your claude has lost its mind :)
Jason