Re: [PATCH net V4 3/4] net/mlx5e: SD, Fix missing cleanup on probe error

From: Shay Drori

Date: Thu Apr 30 2026 - 08:53:18 EST




On 30/04/2026 4:42, Jakub Kicinski wrote:
External email: Use caution opening links or attachments


This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net/mlx5e: SD, Fix missing cleanup on probe error

This commit adds a missing mlx5_sd_cleanup() call to the probe error path.
The commit message notes a similar gap exists on the resume path leaving
sd == NULL on failure, but defers the fix to a follow-up series.

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 5a46870c4b74..e21affd0ffc4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
[ ... ]
@@ -6912,9 +6912,16 @@ static int mlx5e_probe(struct auxiliary_device *adev,
return err;

actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
- if (actual_adev)
- return _mlx5e_probe(actual_adev);
+ if (actual_adev) {
+ err = _mlx5e_probe(actual_adev);
+ if (err)
+ goto sd_cleanup;
+ }
return 0;
+
+sd_cleanup:
+ mlx5_sd_cleanup(mdev);
+ return err;
}

static void _mlx5e_remove(struct auxiliary_device *adev)

Will leaving the resume gap unfixed as mentioned in the commit message cause
a NULL pointer dereference when a secondary device is later removed?

If a secondary device fails resume, sd becomes NULL. Later during removal,
mlx5_sd_get_adev() will check !sd and return the secondary adev.

This causes _mlx5e_remove() to be executed on the secondary adev:

_mlx5e_remove()
{
struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev);
struct mlx5e_dev *mlx5e_dev = auxiliary_get_drvdata(adev);
struct net_device *netdev = mlx5e_dev->netdev;
...
}

Since _mlx5e_probe() is never called on secondary devices,
auxiliary_get_drvdata(adev) returns NULL. Does this mean accessing
mlx5e_dev->netdev will crash?

Would it be safer to add a NULL check for mlx5e_dev in _mlx5e_remove()
until the proper PM lifecycle fixes are implemented?

IMO, a proper fix is needed here. PM lifecycle is implemented, the bug
is in case of error flow...