Re: [PATCH] RDMA/nldev: add mutual exclusion in nldev_dellink()

From: Edward Adam Davis

Date: Sat May 16 2026 - 08:41:04 EST


On Thu, 14 May 2026 11:14:09 -0300, Jason Gunthorpe wrote:
> On Thu, May 14, 2026 at 07:58:18AM -0600, David Ahern wrote:
> > On 5/14/26 5:50 AM, Jason Gunthorpe wrote:
> > > On Thu, May 14, 2026 at 03:31:22PM +0800, Edward Adam Davis wrote:
> > >> On Wed, 13 May 2026 20:46:55 -0300, Jason Gunthorpe wrote:
> > >>> On Wed, May 13, 2026 at 02:17:28PM -0400, Leon Romanovsky wrote:
> > >>>>
> > >>>> On Thu, 07 May 2026 20:50:10 +0800, Edward Adam Davis wrote:
> > >>>>> We must serialize calls to nldev_dellink() or risk a crash as syzbot
> > >>>>> reported:
> > >>>>>
> > >>>>> Call Trace:
> > >>>>> udp_tunnel_sock_release+0x6d/0x80 net/ipv4/udp_tunnel_core.c:197
> > >>>>> rxe_release_udp_tunnel drivers/infiniband/sw/rxe/rxe_net.c:294 [inline]
> > >>>>> rxe_sock_put drivers/infiniband/sw/rxe/rxe_net.c:639 [inline]
> > >>>>> rxe_net_del+0xfb/0x290 drivers/infiniband/sw/rxe/rxe_net.c:660
> > >>>>> rxe_dellink+0x15/0x20 drivers/infiniband/sw/rxe/rxe.c:254
> > >>>>>
> > >>>>> [...]
> > >>>>
> > >>>> Applied, thanks!
> > >>>>
> > >>>> [1/1] RDMA/nldev: add mutual exclusion in nldev_dellink()
> > >>>> https://git.kernel.org/rdma/rdma/c/0b28000b64f40d
> > >>>
> > >>> This seems like a rxe bug, I would have expected the lock to be inside
> > >>> rxe to protect its racy implementation of rxe_net_del(), which looks
> > >>> like it is possibly also triggered by NETDEV_UNREGISTER...
> > >> No, it was triggered by RDMA_NLDEV_CMD_DELLINK, you can see the "call trace".
> >
> > Not that Jason's point. Code wise
> >
> > rxe_dellink -> rxe_net_del
> >
> > netdev NETDEV_UNREGISTER:
> > rxe_notify -> rxe_net_del
> >
> > both can lead to the same problem
> >
> > >>>
> > >>> ie it should not change nldev_dellink().
> > >> While this could be fixed within RXE, the same issue affects all other
> > >> RXE-like submodules when they subsequently support the "dellink" interface,
> > >> therefore, handling this within nldev_dellink() is relatively more appropriate.
> > >
> > > Why would other modules have an issue? The problem is rxe's racey
> > > refcounting scheme for its lazy socket creation. There is nothing
> > > wrong with nldev, and now you've created some nasty BKL in the nldev
> > > code to fix rxe while ignoring its other races.
> >
> > +1
>
> Edward, please come with a fixup on top of this since it was already
> applied
OK.

Edward