Re: [PATCH net-next 03/13] net/mlx5: E-Switch, move devcom init from TC to eswitch layer

From: Shay Drori

Date: Thu May 28 2026 - 14:49:46 EST




On 27/05/2026 15:54, Tariq Toukan wrote:
From: Shay Drory <shayd@xxxxxxxxxx>

Move the E-swtich devcom component management from TC layer to ESW
layer.

This refactoring places devcom lifecycle management at the appropriate
layer and prepares for SD LAG which needs devcom registration
independent of the TC/representor initialization.

Signed-off-by: Shay Drory <shayd@xxxxxxxxxx>
Reviewed-by: Mark Bloch <mbloch@xxxxxxxxxx>
Signed-off-by: Tariq Toukan <tariqt@xxxxxxxxxx>
---
.../net/ethernet/mellanox/mlx5/core/en_tc.c | 20 -------------------
.../mellanox/mlx5/core/eswitch_offloads.c | 6 ++++++
2 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index a9001d1c902f..3846c16c3138 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -5394,8 +5394,6 @@ int mlx5e_tc_esw_init(struct mlx5_rep_uplink_priv *uplink_priv)
{
const size_t sz_enc_opts = sizeof(struct tunnel_match_enc_opts);
u8 mapping_id[MLX5_SW_IMAGE_GUID_MAX_BYTES];
- struct mlx5_devcom_match_attr attr = {};
- struct netdev_phys_item_id ppid;
struct mlx5e_rep_priv *rpriv;
struct mapping_ctx *mapping;
struct mlx5_eswitch *esw;
@@ -5456,14 +5454,6 @@ int mlx5e_tc_esw_init(struct mlx5_rep_uplink_priv *uplink_priv)
goto err_action_counter;
}
- err = netif_get_port_parent_id(priv->netdev, &ppid, false);
- if (!err) {
- memcpy(&attr.key.buf, &ppid.id, ppid.id_len);
- attr.flags = MLX5_DEVCOM_MATCH_FLAGS_NS;
- attr.net = mlx5_core_net(esw->dev);
- mlx5_esw_offloads_devcom_init(esw, &attr);
- }
-
return 0;
err_action_counter:
@@ -5484,16 +5474,6 @@ int mlx5e_tc_esw_init(struct mlx5_rep_uplink_priv *uplink_priv)
void mlx5e_tc_esw_cleanup(struct mlx5_rep_uplink_priv *uplink_priv)
{
- struct mlx5e_rep_priv *rpriv;
- struct mlx5_eswitch *esw;
- struct mlx5e_priv *priv;
-
- rpriv = container_of(uplink_priv, struct mlx5e_rep_priv, uplink_priv);
- priv = netdev_priv(rpriv->netdev);
- esw = priv->mdev->priv.eswitch;
-
- mlx5_esw_offloads_devcom_cleanup(esw);
-
mlx5e_tc_tun_cleanup(uplink_priv->encap);
mapping_destroy(uplink_priv->tunnel_enc_opts_mapping);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 189be11c4c39..d9683d3ea0e7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -3866,6 +3866,7 @@ bool mlx5_esw_offloads_controller_valid(const struct mlx5_eswitch *esw, u32 cont
int esw_offloads_enable(struct mlx5_eswitch *esw)
{
u8 mapping_id[MLX5_SW_IMAGE_GUID_MAX_BYTES];
+ struct mlx5_devcom_match_attr attr = {};
struct mapping_ctx *reg_c0_obj_pool;
struct mlx5_vport *vport;
unsigned long i;
@@ -3926,6 +3927,10 @@ int esw_offloads_enable(struct mlx5_eswitch *esw)
if (err)
goto err_vports;
+ memcpy(attr.key.buf, mapping_id, id_len);
+ attr.flags = MLX5_DEVCOM_MATCH_FLAGS_NS;
+ attr.net = mlx5_core_net(esw->dev);
+ mlx5_esw_offloads_devcom_init(esw, &attr);
Sashiko.dev says:
"Does this code introduce a race condition by registering the VF representor
netdevices before the devcom component is initialized?
Because mlx5_eswitch_enable_pf_vf_vports() was called just before this block,
it registers the representor netdevices and emits RTM_NEWLINK events.
Userspace tools can then start asynchronously configuring the interfaces.
If userspace attempts to add a TC rule requiring peer flow programming (like
multipath or LAG) before mlx5_esw_offloads_devcom_init() completes, the driver
will attempt to access esw->devcom while it is still NULL. Functions like
mlx5_devcom_for_each_peer_begin() will return false, causing the TC rule
addition to fail with -ENODEV."

if devcom is NULL, is_peer_flow_needed() will return false, and
mlx5e_add_fdb_flow() will return success.


Sashiko.dev says:
"Can this unconditional devcom component initialization cause incorrect
pairing of independent devices if the system image GUID is unavailable?
Previously, netif_get_port_parent_id() would fail and safely bypass devcom
registration if the device lacked a valid hardware GUID. Now,
mlx5_query_nic_sw_system_image_guid() is called earlier in this function
and populates mapping_id and id_len.
If the device lacks a valid hardware GUID (id_len == 0), attr.key.buf will
remain filled with zeros (due to the struct initialization). Unconditionally
registering the devcom component via mlx5_esw_offloads_devcom_init()
with this all-zero key could cause multiple independent devices on the
same host to falsely pair with each other."

mlx5_query_nic_sw_system_image_guid() can only fail with -ENOMEM or
on a FW command error. In both cases mlx5 would have failed earlier
and never reached mlx5_esw_offloads_devcom_init().

return 0;
err_vports:
@@ -3970,6 +3975,7 @@ static int esw_offloads_stop(struct mlx5_eswitch *esw,
void esw_offloads_disable(struct mlx5_eswitch *esw)
{
+ mlx5_esw_offloads_devcom_cleanup(esw);
mlx5_eswitch_disable_pf_vf_vports(esw);
mlx5_esw_offloads_rep_unload(esw, MLX5_VPORT_UPLINK);
esw_set_passing_vport_metadata(esw, false);