Re: [PATCH net 1/3] net/mlx5e: SD, Fix race condition in secondary device probe/remove
From: Shay Drori
Date: Thu Apr 02 2026 - 16:03:53 EST
On 02/04/2026 6:08, Jakub Kicinski wrote:
External email: Use caution opening links or attachments
On Mon, 30 Mar 2026 22:34:10 +0300 Tariq Toukan wrote:
From: Shay Drory <shayd@xxxxxxxxxx>
When utilizing Socket-Direct single netdev functionality the driver
resolves the actual auxiliary device using mlx5_sd_get_adev(). However,
the current implementation returns the primary ETH auxiliary device
without holding the device lock, leading to a potential race condition
where the ETH device could be unbound or removed concurrently during
probe, suspend, resume, or remove operations.[1]
Fix this by introducing mlx5_sd_put_adev() and updating
mlx5_sd_get_adev() so that secondaries devices would acquire the device
lock of the returned auxiliary device. After the lock is acquired, a
second devcom check is needed[2].
In addition, update The callers to pair the get operation with the new
put operation, ensuring the lock is held while the auxiliary device is
being operated on and released afterwards.
Please explain why the "primary" designation is reliable, and therefore
we can be sure there will be no ABBA deadlock here
The "primary" designation is determined once in sd_register(). It's set
before devcom is marked ready, and it never changes after that.
In Addition, The primary path never locks a secondary: When the primary
device invoke mlx5_sd_get_adev(), it sees dev == primary and returns.
no additional lock is taken.
Therefore lock ordering is always: secondary_lock → primary_lock. The
reverse never happens, so ABBA deadlock is impossible.
Does the above is the explanation you looked for?
If not, can you elaborate?
If yes, to add it to the commit message in V2?
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index b6c12460b54a..5761f655f488 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -6657,8 +6657,11 @@ static int mlx5e_resume(struct auxiliary_device *adev)
return err;
actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
- if (actual_adev)
- return _mlx5e_resume(actual_adev);
+ if (actual_adev) {
+ err = _mlx5e_resume(actual_adev);
+ mlx5_sd_put_adev(actual_adev, adev);
+ return err;
+ }
return 0;
Feels like I recently complained about similar code y'all were trying
to add. Magically and conditionally locking something in a get helper
makes for extremely confusing code.
Do you think explicit locking API is preferred here?
something like:
new_locking_api()
mlx5_sd_get_adev()
new_unlocking_api()
thanks for the review
--
pw-bot: cr