Re: [PATCH] virtio-net: disable delayed refill when pausing rx

From: Bui Quang Minh
Date: Thu Apr 10 2025 - 03:04:33 EST


On 4/9/25 13:44, Bui Quang Minh wrote:
On 4/8/25 14:34, Jason Wang wrote:
On Mon, Apr 7, 2025 at 10:27 AM Bui Quang Minh <minhquangbui99@xxxxxxxxx> wrote:
On 4/7/25 08:03, Xuan Zhuo wrote:
On Fri,  4 Apr 2025 16:39:03 +0700, Bui Quang Minh <minhquangbui99@xxxxxxxxx> wrote:
When pausing rx (e.g. set up xdp, xsk pool, rx resize), we call
napi_disable() on the receive queue's napi. In delayed refill_work, it
also calls napi_disable() on the receive queue's napi. This can leads to
deadlock when napi_disable() is called on an already disabled napi. This
scenario can be reproducible by binding a XDP socket to virtio-net
interface without setting up the fill ring. As a result, try_fill_recv
will fail until the fill ring is set up and refill_work is scheduled.
So, what is the problem? The refill_work is waiting? As I know, that thread
will sleep some time, so the cpu can do other work.
When napi_disable is called on an already disabled napi, it will sleep
in napi_disable_locked while still holding the netdev_lock. As a result,
later napi_enable gets stuck too as it cannot acquire the netdev_lock.
This leads to refill_work and the pause-then-resume tx are stuck altogether.
This needs to be added to the chagelog. And it looks like this is a fix for

commit 413f0271f3966e0c73d4937963f19335af19e628
Author: Jakub Kicinski <kuba@xxxxxxxxxx>
Date:   Tue Jan 14 19:53:14 2025 -0800

     net: protect NAPI enablement with netdev_lock()

?

I'm not aware of this, will update the fix tags in the next patch.

I wonder if it's simpler to just hold the netdev lock in resize or xsk
binding instead of this.

That looks cleaner, let me try that approach.

We need to hold the netdev_lock in the refill work too. Otherwise, we get this race

refill_work                                xdp bind
napi_disable()

                                            netdev_lock(dev)
                                            napi_disable_locked() <- hang
napi_enable() <- cannot acquire the lock


I don't like the idea to hold netdev_lock in refill_work. I will reply with the patch for you to review.

Thanks,
Quang Minh.