RE: [PATCH] rhashtable: Fix potential deadlock by moving schedule_work outside lock

From: Michael Kelley
Date: Wed Jan 08 2025 - 22:16:19 EST


From: Breno Leitao <leitao@xxxxxxxxxx> Sent: Thursday, January 2, 2025 2:16 AM
>
> On Sat, Dec 21, 2024 at 05:06:55PM +0800, Herbert Xu wrote:
> > On Thu, Dec 12, 2024 at 08:33:31PM +0800, Herbert Xu wrote:
> > >
> > > The growth check should stay with the atomic_inc. Something like
> > > this should work:
> >
> > OK I've applied your patch with the atomic_inc move.
>
> Sorry, I was on vacation, and I am back now. Let me know if you need
> anything further.
>
> Thanks for fixing it,
> --breno

Breno and Herbert --

This patch seems to break things in linux-next. I'm testing with
linux-next20250108 in a VM in the Azure public cloud. The Mellanox mlx5
ethernet NIC in the VM is failing to get setup.

I bisected to commit e1d3422c95f0 ("rhashtable: Fix potential deadlock
by moving schedule_work outside lock"), then debugged why opening
the mlx5 NIC device is failing. The failure is in the XDP code in function
__xdp_reg_mem_model() where the call to rhashtable_insert_slow()
is returning -E2BIG. The problem does not occur when the commit
is reverted.

The function call stack is this:

dev_open()
__dev_open()
mlx5e_open()
mlx5e_open_locked()
mlx5e_open_channels()
mlx5e_open_channel()
mlx5e_open_queues()
mlx5e_open_rxq_rq()
mlx5e_open_rq()
mlx5e_alloc_rq()
xdp_rxq_info_reg_mem_model()
__xdp_reg_mem_model()
rhashtable_insert_slow()

I have not debugged further as I don't know anything about the
rhashtable code or the XDP code. The only repro I have is a VM
in Azure. I thought I'd ask you (Breno and Herbert) to review
the patch again and see if there's a path that could cause the
hash table to be incorrectly detected as full.

I've included the linux-hyperv mailing list and the mlx5 driver
maintainers on this email. Someone involved with Azure/Hyper-V
or the mlx5 driver may have seen the problem, and I want to try
to avoid duplicative debugging.

Let me know if there's something I can do to help debug further.

Thanks,

Michael Kelley