Re: [PATCH net V4 4/4] net/mlx5e: SD, Fix race condition in secondary device probe/remove
From: Shay Drori
Date: Thu Apr 30 2026 - 09:04:45 EST
On 30/04/2026 4:42, Jakub Kicinski wrote:
External email: Use caution opening links or attachments
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net/mlx5e: SD, Fix race condition in secondary device probe/remove
This commit updates secondary devices to acquire the device lock of the
primary auxiliary device during probe, suspend, resume, and remove
operations. It introduces mlx5_sd_get_adev() and mlx5_sd_put_adev() to
manage this lock ordering and add a second devcom readiness check.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
index 7a1787f15320..a43ae482a679 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
[ ... ]
@@ -563,5 +567,18 @@ struct auxiliary_device *mlx5_sd_get_adev(struct mlx5_core_dev *dev,
if (dev == primary)
return adev;
+ device_lock(&primary->priv.adev[idx]->adev.dev);
If mlx5_devcom_comp_is_ready() returned true just before this, but the
secondary thread is preempted before reaching device_lock(), could a
concurrent removal of the primary device cause primary->priv.adev[idx]
to become NULL or freed?
It looks like the removal path (such as mlx5_unregister_device) frees the
mlx5_adev memory and explicitly sets primary->priv.adev[idx] to NULL. If
the secondary thread then resumes, it might dereference a NULL or freed
pointer when attempting to acquire the lock.
Since the second readiness check happens after the device_lock() call, does
this leave a window where memory corruption or a panic can still occur?
This is also correct and will fix in next version.
Additionally, since the driver core already holds device_lock(&adev->dev)
for the secondary device during callbacks like probe() and remove(), will
acquiring the primary device's lock here trigger a lockdep warning for
possible recursive locking?
Both the primary and secondary auxiliary devices use the same generic lock
class assigned in device_initialize(). While the fixed secondary to primary
ordering avoids an actual deadlock, lockdep might not distinguish between
the two mutexes without explicit lockdep subclassing.
+ /* In case primary finish removing its adev */
+ if (!mlx5_devcom_comp_is_ready(sd->devcom)) {
+ device_unlock(&primary->priv.adev[idx]->adev.dev);
+ return NULL;
+ }
return &primary->priv.adev[idx]->adev;
}
I test this code with KASAN and LOCKDEP enable, I didn't get any
splat...