Re: [PATCH] RDMA/nldev: add mutual exclusion in nldev_dellink()

From: Jason Gunthorpe

Date: Thu May 14 2026 - 10:30:34 EST


On Thu, May 14, 2026 at 07:58:18AM -0600, David Ahern wrote:
> On 5/14/26 5:50 AM, Jason Gunthorpe wrote:
> > On Thu, May 14, 2026 at 03:31:22PM +0800, Edward Adam Davis wrote:
> >> On Wed, 13 May 2026 20:46:55 -0300, Jason Gunthorpe wrote:
> >>> On Wed, May 13, 2026 at 02:17:28PM -0400, Leon Romanovsky wrote:
> >>>>
> >>>> On Thu, 07 May 2026 20:50:10 +0800, Edward Adam Davis wrote:
> >>>>> We must serialize calls to nldev_dellink() or risk a crash as syzbot
> >>>>> reported:
> >>>>>
> >>>>> Call Trace:
> >>>>> udp_tunnel_sock_release+0x6d/0x80 net/ipv4/udp_tunnel_core.c:197
> >>>>> rxe_release_udp_tunnel drivers/infiniband/sw/rxe/rxe_net.c:294 [inline]
> >>>>> rxe_sock_put drivers/infiniband/sw/rxe/rxe_net.c:639 [inline]
> >>>>> rxe_net_del+0xfb/0x290 drivers/infiniband/sw/rxe/rxe_net.c:660
> >>>>> rxe_dellink+0x15/0x20 drivers/infiniband/sw/rxe/rxe.c:254
> >>>>>
> >>>>> [...]
> >>>>
> >>>> Applied, thanks!
> >>>>
> >>>> [1/1] RDMA/nldev: add mutual exclusion in nldev_dellink()
> >>>> https://git.kernel.org/rdma/rdma/c/0b28000b64f40d
> >>>
> >>> This seems like a rxe bug, I would have expected the lock to be inside
> >>> rxe to protect its racy implementation of rxe_net_del(), which looks
> >>> like it is possibly also triggered by NETDEV_UNREGISTER...
> >> No, it was triggered by RDMA_NLDEV_CMD_DELLINK, you can see the "call trace".
>
> Not that Jason's point. Code wise
>
> rxe_dellink -> rxe_net_del
>
> netdev NETDEV_UNREGISTER:
> rxe_notify -> rxe_net_del
>
> both can lead to the same problem
>
> >>>
> >>> ie it should not change nldev_dellink().
> >> While this could be fixed within RXE, the same issue affects all other
> >> RXE-like submodules when they subsequently support the "dellink" interface,
> >> therefore, handling this within nldev_dellink() is relatively more appropriate.
> >
> > Why would other modules have an issue? The problem is rxe's racey
> > refcounting scheme for its lazy socket creation. There is nothing
> > wrong with nldev, and now you've created some nasty BKL in the nldev
> > code to fix rxe while ignoring its other races.
>
> +1

Edward, please come with a fixup on top of this since it was already
applied

Jason