Re: [PATCH net V3 3/4] net/mlx5e: SD, Fix missing cleanup on probe/resume error
From: Shay Drori
Date: Sun Apr 26 2026 - 06:46:50 EST
On 23/04/2026 15:31, Tariq Toukan wrote:
From: Shay Drory <shayd@xxxxxxxxxx>
When _mlx5e_probe() or _mlx5e_resume() fails, the preceding
successful mlx5_sd_init() is not undone, leaving the SD group in an
UP state without a matching cleanup.
Call mlx5_sd_cleanup() on the error path so the SD state is reset.
Fixes: 381978d28317 ("net/mlx5e: Create single netdev per SD group")
Signed-off-by: Shay Drory <shayd@xxxxxxxxxx>
Signed-off-by: Tariq Toukan <tariqt@xxxxxxxxxx>
---
.../net/ethernet/mellanox/mlx5/core/en_main.c | 22 +++++++++++++++----
1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 5a46870c4b74..9c340ad2fe09 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -6774,9 +6774,16 @@ static int mlx5e_resume(struct auxiliary_device *adev)
return err;
actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
- if (actual_adev)
- return _mlx5e_resume(actual_adev);
+ if (actual_adev) {
+ err = _mlx5e_resume(actual_adev);
+ if (err)
+ goto sd_cleanup;
+ }
return 0;
+
+sd_cleanup:
+ mlx5_sd_cleanup(mdev);
+ return err;
}
Sashiko:
"If _mlx5e_resume() fails when called from a secondary device's mlx5e_resume(),
mlx5_sd_cleanup() is called on the secondary device. This frees the sd struct
and marks the devcom group as not ready.
Since this is a resume failure, the secondary device remains bound to the
driver. Later, when the driver is unbound, mlx5e_remove() is invoked on the
secondary device.
Could this result in a NULL pointer dereference?
When mlx5e_remove() calls mlx5_sd_get_adev(), it returns the secondary adev
itself because the sd struct was already freed:
mlx5_sd_get_adev() {
...
if (!sd)
return adev;
...
}
This leads to _mlx5e_remove() being erroneously called on the secondary
device. Inside _mlx5e_remove(), auxiliary_get_drvdata() is called.
Because drvdata is only set on the primary device during probe, this
returns NULL. Dereferencing mlx5e_dev->netdev would then cause a kernel
panic."
That is correct. In next version mlx5_sd_cleanup() won't be called in
mlx5e_resume(), and the commit message will elaborate to explain the
different handling between probe and resume.
static int _mlx5e_suspend(struct auxiliary_device *adev, bool pre_netdev_reg)
@@ -6912,9 +6919,16 @@ static int mlx5e_probe(struct auxiliary_device *adev,
return err;
actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
- if (actual_adev)
- return _mlx5e_probe(actual_adev);
+ if (actual_adev) {
+ err = _mlx5e_probe(actual_adev);
+ if (err)
+ goto sd_cleanup;
+ }
return 0;
+
+sd_cleanup:
+ mlx5_sd_cleanup(mdev);
+ return err;
}
static void _mlx5e_remove(struct auxiliary_device *adev)