Re: [PATCH net 5/6] net/mlx5e: Fix deadlocks between devlink and netdev instance locks

Next message: Chun-Tse Shao: "Re: [PATCH v2] perf pmu intel: Adjust cpumaks for sub-NUMA clusters on Emeraldrapids"
Previous message: Ziyi Guo: "[PATCH net] xen-netback: reject zero-queue configuration from guest"
In reply to: Tariq Toukan: "[PATCH net 5/6] net/mlx5e: Fix deadlocks between devlink and netdev instance locks"
Next in thread: Tariq Toukan: "[PATCH net 6/6] net/mlx5e: Use unsigned for mlx5e_get_max_num_channels"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Jacob Keller

Date: Thu Feb 12 2026 - 17:41:39 EST

On 2/12/2026 2:32 AM, Tariq Toukan wrote:

From: Cosmin Ratiu <cratiu@xxxxxxxxxx>

In the mentioned "Fixes" commit, various work tasks triggering devlink
health reporter recovery were switched to use netdev_trylock to protect
against concurrent tear down of the channels being recovered. But this
had the side effect of introducing potential deadlocks because of
incorrect lock ordering.

The correct lock order is described by the init flow:
probe_one -> mlx5_init_one (acquires devlink lock)
-> mlx5_init_one_devl_locked -> mlx5_register_device
-> mlx5_rescan_drivers_locked -...-> mlx5e_probe -> _mlx5e_probe
-> register_netdev (acquires rtnl lock)
-> register_netdevice (acquires netdev lock)
=> devlink lock -> rtnl lock -> netdev lock.

But in the current recovery flow, the order is wrong:
mlx5e_tx_err_cqe_work (acquires netdev lock)
-> mlx5e_reporter_tx_err_cqe -> mlx5e_health_report
-> devlink_health_report (acquires devlink lock => boom!)
-> devlink_health_reporter_recover
-> mlx5e_tx_reporter_recover -> mlx5e_tx_reporter_recover_from_ctx
-> mlx5e_tx_reporter_err_cqe_recover

The same pattern exists in:
mlx5e_reporter_rx_timeout
mlx5e_reporter_tx_ptpsq_unhealthy
mlx5e_reporter_tx_timeout

Fix these by moving the netdev_trylock calls from the work handlers
lower in the call stack, in the respective recovery functions, where
they are actually necessary.

Fixes: 8f7b00307bf1 ("net/mlx5e: Convert mlx5 netdevs to instance locking")
Signed-off-by: Cosmin Ratiu <cratiu@xxxxxxxxxx>
Reviewed-by: Dragos Tatulea <dtatulea@xxxxxxxxxx>
Signed-off-by: Tariq Toukan <tariqt@xxxxxxxxxx>
---

Reviewed-by: Jacob Keller <jacob.e.keller@xxxxxxxxx>