Re: [PATCH net-next 00/13] net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 1/2

From: Jacob Keller

Date: Wed May 27 2026 - 18:09:04 EST


On 5/27/2026 5:54 AM, Tariq Toukan wrote:
> Hi,
>
> This series enables Socket Direct single netdev to operate in switchdev
> mode with shared FDB. See detailed feature description by Shay below.
>
> Regards,
> Tariq
>
>
> This series enables Socket Direct single netdev to operate in switchdev
> mode with shared FDB. SD single netdev combines multiple PCI functions
> behind a single netdev interface. To support switchdev offloads, these
> functions must participate in virtual LAG (shared FDB).
>
> Design
>
> Rather than introducing a separate LAG instance for SD, this series
> integrates SD secondary devices into the existing LAG structure
> (priv.lag) created at probe time. Each lag_func entry carries a
> group_id field that identifies its SD group membership (0 means not
> part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes
> physical port entries from SD secondaries, enabling a single unified
> iterator that filters by group:
>
> - MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing
> behavior, used by bonding, FW LAG commands, v2p_map)
> - MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries
> (used by MPESW shared FDB across all devices)
> - specific group_id: iterate only devices in that SD group (used by
> per-group SD shared FDB operations)
>
> Existing callers use mlx5_ldev_for_each() which maps to
> MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD
> configurations.
>
> Lifecycle and ownership
>
> The SD LAG lifecycle is tied to the SD group, not to bonding events:
>
> 1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure
> (priv.lag) for each LAG-capable PF. e.g.: SD primary devices
>
> 2. During mlx5_sd_init(), after the SD group is fully formed (primary
> and secondaries paired), sd_lag_init() registers the secondary
> devices into the primary's existing priv.lag by calling
> mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func
> also gets its group_id set. No separate LAG instance is created.
>
> 3. After all the devices in SD group transition to switchdev,
> mlx5_lag_shared_fdb_create() is invoked with the group_id to create
> a software-only shared FDB scoped to that SD group. This sets
> sd_fdb_active on all lag_func entries in the group. No FW LAG
> commands are issued since SD devices share the same physical port.
>
> 4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the
> per-group SD shared FDB is torn down first, then MPESW shared FDB is
> created spanning all devices (ports + SD secondaries) using
> MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is
> restored.
>
> 5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup()
> removes secondaries from priv.lag and clears the primary's group_id.
> The LAG structure itself is not destroyed.
>
> The sd_fdb_active flag is set on all lag_func entries in a group (not
> just the primary), so any device can detect the SD shared FDB state
> during lag_disable_change teardown without needing to look up peer
> entries.
>
> SD shared FDB is a pure software construct -- unlike regular LAG modes
> (ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag
> commands. The software vport LAG for SD is implemented via eswitch
> egress ACL bounce rules, managed by the IB layer through
> mlx5_eth_lag_init(). And the software LAG demux is implemented via
> steering rules that utilize new destination, VHCA_RX.
>

I appreciate the overall details on the lifecycle and ownership. That
made it easier to follow the patches and understand the changes.

> Patches
>
> Infrastructure (patches 1, 5-6):
> - Factor out shared FDB code into a dedicated file
> - Extend lag_func with group_id and sd_fdb_active fields;
> add XA_MARK_PORT and unified iterator with group_id filter
> - Extend shared FDB API with group_id parameter
>
> E-Switch preparation (patches 2-3):
> - Align eswitch disable sequence ordering
> - Move devcom init from TC to eswitch layer
>
> SD group management (patches 4, 7-9):
> - Replace peer count check with direct peer lookup
> - Register SD secondaries in the existing LAG at SD init time
> - Block RoCE and VF LAG for SD devices
> - Block multipath LAG for SD devices
>
> Switchdev integration (patch 10):
> - Keep netdev resources local in switchdev mode
>
> Steering (patches 11-12):
> - Track peer flow slots with bitmap for selective peer flow deletion
> - Enable TC flow steering for SD LAG
>
> Enablement (patch 13):
> - Verify unique vhca_id count for cross-VHCA RQT
>

The patch 13 being the "enablement" is a bit confusing to me since I had
trouble understanding how the patch description is "enabling" the socket
direct stuff.. But the description does say "part 1/2" so I am guessing
thats addressed in part 2?

> Shay Drory (13):
> net/mlx5: LAG, factor out shared FDB code into dedicated file
> net/mlx5: E-Switch, align disable sequence with switchdev-to-legacy
> transition
> net/mlx5: E-Switch, move devcom init from TC to eswitch layer
> net/mlx5: LAG, replace peer count check with direct peer lookup
> net/mlx5: LAG, prepare for SD device integration
> net/mlx5: LAG, extend shared FDB API with group_id filter
> net/mlx5: SD, introduce Socket Direct LAG
> net/mlx5: LAG, block RoCE and VF LAG for SD devices
> net/mlx5: LAG, block multipath LAG for SD devices
> net/mlx5: SD, keep netdev resources on same PF in switchdev mode
> net/mlx5e: TC, track peer flow slots with bitmap
> net/mlx5e: TC, enable steering for SD LAG
> net/mlx5e: Verify unique vhca_id count instead of range
>
> .../net/ethernet/mellanox/mlx5/core/Makefile | 2 +-
> .../net/ethernet/mellanox/mlx5/core/en/rqt.c | 27 +-
> .../ethernet/mellanox/mlx5/core/en/tc_priv.h | 7 +
> .../net/ethernet/mellanox/mlx5/core/en_tc.c | 83 ++--
> .../net/ethernet/mellanox/mlx5/core/eswitch.h | 11 +-
> .../mellanox/mlx5/core/eswitch_offloads.c | 26 ++
> .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 429 ++++++++++--------
> .../net/ethernet/mellanox/mlx5/core/lag/lag.h | 100 +++-
> .../net/ethernet/mellanox/mlx5/core/lag/mp.c | 4 +
> .../ethernet/mellanox/mlx5/core/lag/mpesw.c | 28 +-
> .../mellanox/mlx5/core/lag/shared_fdb.c | 233 ++++++++++
> .../net/ethernet/mellanox/mlx5/core/lib/sd.c | 227 +++++++--
> .../net/ethernet/mellanox/mlx5/core/lib/sd.h | 23 +
> .../net/ethernet/mellanox/mlx5/core/main.c | 3 +-
> 14 files changed, 914 insertions(+), 289 deletions(-)
> create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c
>
>
> base-commit: aa064a614efcfa4c300609d1f01134e99a12ad10