Re: [syzbot] [rdma?] kernel BUG in ib_device_get_by_index
From: Tetsuo Handa
Date: Tue Mar 03 2026 - 08:48:10 EST
On 2026/03/03 4:17, Leon Romanovsky wrote:
> On Sat, Feb 28, 2026 at 02:07:46PM +0900, Tetsuo Handa wrote:
>> Hmm, this assertion was wrong because ib_device_get_by_index()
>> might be called before enable_device_and_get() is called.
>>
>> #syz invalid
>
> I think this is a valid syzkaller report. As you correctly noted, the device
> was inserted into the xarray database in assign_name(), but its refcount was
> only set later in enable_device_and_get().
I was wondering why enable_device_and_get() is using not refcount_add()
but refcount_set(), and I tried
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit?id=510cd4b7d46753b4bf0f57004aa7b53b91b2b25a
in case commit 9af0feae8016 ("RDMA/core: Fix stale RoCE GIDs during netdev
events at registration") unexpectedly triggered modification of ->refcount
before refcount_set(&device->refcount, 2) is called.
But I concluded from this syzbot report that the reason enable_device_and_get() is
using refcount_set() is that we cannot use refcount_add() because ->refcount == 0.
Therefore, it is safe to call ib_device_try_get() before enable_device_and_get()
calls refcount_set().
>
> The proper fix can be something like that:
>
down_read(&devices_rwsem);
device = xa_load(&devices, index);
- if (device) {
+ if (device && xa_get_mark(&devices, index, DEVICE_REGISTERED)) {
if (!rdma_dev_access_netns(device, net)) {
device = NULL;
goto out;
}
if (!ib_device_try_get(device))
device = NULL;
}
Why do you want to make this change? Unless it is unsafe to call
rdma_dev_access_netns() when DEVICE_REGISTERED is not set,
refcount_inc_not_zero() from ib_device_try_get() makes the final
result same (i.e. device == NULL).
Since enable_device_and_get() sets ->refcount immediately before
xa_set_mark() is called, adding xa_get_mark() check does not change
effective behavior.
What I rather worry is that refcount_set() is called too early if
there is an ib_device_try_get() user who expects that
device->ops.enable_driver()/add_client_context()/add_compat_devs()
have already completed when ib_device_try_get() succeeded.