Re: [PATCH net-next v5 0/2] net: Split ndo_set_rx_mode into snapshot

Next message: Dave Hansen: "Re: [RFC PATCH v7 30/31] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y"
Previous message: Ryan Foster: "[PATCH v2] security: Rename functions and add namespace mapping tests"
In reply to: Jakub Kicinski: "Re: [PATCH net-next v5 0/2] net: Split ndo_set_rx_mode into snapshot"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: I Viswanath

Date: Fri Nov 21 2025 - 12:49:05 EST

On Thu, 20 Nov 2025 at 20:47, Jakub Kicinski <kuba@xxxxxxxxxx> wrote:

> Running
>
> make -C tools/testing/selftests TARGETS="drivers/net/virtio_net" run_tests

This bug seems to be caused by a call to probe() followed by remove()
without ever calling
dev_open() as dev->rx_mode_ctx is allocated there. Modifying
netif_rx_mode_flush_work()
to call flush_work only when netif_running() is true, seems to fix
this specific bug.

However, I found the following deadlock while trying to reproduce that:

dev_close():
rtnl_lock();
cancel_work_sync(); // wait for netif_rx_mode_write_active to complete

netif_rx_mode_write_active(): // From work item

rtnl_lock(); // Wait for the rtnl lock to be released

I can't find a good way to solve this without changing alloc logic to
be partly in
alloc_netdev_mqs since we need the work struct to be alive after
closing. Does this
look good if that's really the most reasonable solution:

struct netif_rx_mode_ctx *rx_mode_ctx;

struct netif_rx_mode_ctx {
struct work_struct rx_mode_work;
struct netif_rx_mode_active_ctx *active_ctx;
int state;
}

struct netif_rx_mode_active_ctx {
struct net_device *dev;
struct netif_rx_mode_config *ready;
struct netif_rx_mode_config *pending;
}

rx_mode_ctx will be handled in alloc_netdev_mqs()/free_netdev() while active_ctx
will be handled in dev_open()/dev_close()

Never call flush_work/cancel_work_sync for this work in core
as that is a guaranteed deadlock because of how everything is serialized