Re: [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init

From: Mark Bloch

Date: Tue May 26 2026 - 11:18:06 EST




On 26/05/2026 17:07, Jiri Pirko wrote:
> Tue, May 26, 2026 at 11:44:46AM +0200, mbloch@xxxxxxxxxx wrote:
>>
>>
>> On 26/05/2026 10:44, Jiri Pirko wrote:
>>> Thu, May 21, 2026 at 09:24:34AM +0200, tariqt@xxxxxxxxxx wrote:
>>>> From: Mark Bloch <mbloch@xxxxxxxxxx>
>>>>
>>>> Apply devlink default eswitch mode for mlx5 devices after successful
>>>> device initialization while holding the devlink instance lock.
>>>>
>>>> At this point the devlink instance is registered and the mlx5 devlink
>>>> operations are available, so the default eswitch mode can be applied to
>>>> the matching PCI devlink handle.
>>>>
>>>> Signed-off-by: Mark Bloch <mbloch@xxxxxxxxxx>
>>>> Reviewed-by: Shay Drori <shayd@xxxxxxxxxx>
>>>> Reviewed-by: Moshe Shemesh <moshe@xxxxxxxxxx>
>>>> Signed-off-by: Tariq Toukan <tariqt@xxxxxxxxxx>
>>>> ---
>>>> drivers/net/ethernet/mellanox/mlx5/core/main.c | 17 +++++++++++++++++
>>>> 1 file changed, 17 insertions(+)
>>>>
>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>> index 0c6e4efe38c8..4528097f3d84 100644
>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>> @@ -1391,6 +1391,21 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
>>>> mlx5_free_bfreg(dev, &dev->priv.bfreg);
>>>> }
>>>>
>>>> +static void mlx5_devl_apply_default_esw_mode(struct mlx5_core_dev *dev)
>>>> +{
>>>> + struct devlink *devlink = priv_to_devlink(dev);
>>>> + int err;
>>>> +
>>>> + if (!MLX5_ESWITCH_MANAGER(dev))
>>>> + return;
>>>> +
>>>> + devl_assert_locked(devlink);
>>>> + err = devl_apply_default_esw_mode(devlink);
>>>> + if (err)
>>>> + mlx5_core_warn(dev, "Couldn't apply default eswitch mode, err %d\n",
>>>> + err);
>>>> +}
>>>> +
>>>> int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>> {
>>>> bool light_probe = mlx5_dev_is_lightweight(dev);
>>>> @@ -1437,6 +1452,7 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>> mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err);
>>>>
>>>> mutex_unlock(&dev->intf_state_mutex);
>>>> + mlx5_devl_apply_default_esw_mode(dev);
>>>
>>> I wonder how we can make this work for all. I mean, other driver would
>>> silently ignore this command like arg, right? Any idea how to make all
>>> drivers follow the arg from very beginning?
>>>
>>
>> I have a follow-up series that adds the call to all drivers which support
>> setting eswitch mode. When going over the other drivers, what I found is
>> that the right point to apply the default is driver specific, drivers
>> I have patch for:
>>
>> 46e16c6d9836 net: Apply devlink esw mode defaults
>> ab4f54102ba9 bnxt_en: Apply devlink default eswitch mode during init
>> b48cce1607bb liquidio: Apply devlink default eswitch mode during init
>> 4ea54b0fe04a ice: Apply devlink default eswitch mode during init
>> b7faddaa1c90 octeontx2-af: Apply devlink default eswitch mode during init
>> 74b0c22c47b9 octeontx2-pf: Apply devlink default eswitch mode during init
>> 5000e4c3d768 nfp: Apply devlink default eswitch mode during init
>> 97a218e95e41 netdevsim: Apply devlink default eswitch mode during init
>>
>> I don't think doing this generically from devlink is realistic. devlink
>> doesn't really know when a given driver is ready to change eswitch mode.
>> Some drivers need SR-IOV state, representor setup, or other init pieces to
>> be ready first, and the locking is not identical across drivers either.
>
>
> Low hanging fruit would be just to call ops->eswitch_mode_set at the end
> of register. Multiple reasons:
>
> 1) end of devl_register is exactly the point userspace is free to issue
> the eswitch mode set. Driver should be ready to handle it.
> 2) all drivers would transparently get this functionality, without
> actually knowing this kernel command line arg ever existed, without
> odd wiring call of related exported function. I prefer that stongly.
> 3) you should add a there warning for the case this arg is passed yet
> the driver does not implement eswitch_mode_set. User should
> get a feedback like this, not silent ignore.
>
> The only loose end is see it the void return of devl_register().
> Multiple ways to handle the possibly failed eswitch_mode_set(). I would
> probably just go for pr_warn, seems to be the most correct.
>
> Make sense?

I see the point, but I don't think devl_register() (at least not the only place)
is the right place.

There is a small but important difference between userspace doing
"devlink eswitch set" after register is done, and devlink core calling
eswitch_mode_set() from inside the register flow.

Some drivers call devlink_register() while holding the device lock.
liquidio is one example. If devlink core calls ops->eswitch_mode_set() from
there, we may start the full eswitch mode change while holding that lock.
That mode change can create representors, register netdevs, take rtnl,
allocate resources, etc. I don't think we want this to become an implicit
side effect of devlink registration.

For mlx5, the placement after intf_state_mutex is also intentional:

mutex_unlock(&dev->intf_state_mutex);
mlx5_devl_apply_default_esw_mode(dev);

We can't call it while holding intf_state_mutex because the mode set path
takes it internally, and switchdev mode may also create IB representors.

Also, devl_register() only covers the first registration. The mlx5 call in
mlx5_load_one_devl_locked() is for reload/fw reset recovery kind of flows.
In those flows devlink is already registered, so devl_register() is not
called again, but the driver state was rebuilt and we may need to apply the
default again.

Same for reload, fw reset and pci recovery in general. If the driver tears
down and rebuilds eswitch related state, the place to apply the default is
in that driver's reinit flow, not in devl_register().

When I went over the other drivers, the right place was not always the same
as devlink registration. I'm not an expert in any of them, so I hope I got
the details right, but for example octeontx2 AF needs sr-iov and the
representor switch state to be initialized first. nfp can do it after
app/vNIC init while the devlink lock is already held. liquidio should do it
only after dropping the PCI device lock.

Mark

>
>
>>
>> Also, since this knob is only about eswitch mode, I don't think we need to
>> touch every devlink driver. Drivers that don't implement eswitch_mode_set()
>> would just ignore it anyway. The follow-up only wires the default into
>> drivers that actually support changing eswitch mode.
>>
>> Mark
>>
>>>
>>>> return 0;
>>>>
>>>> err_register:
>>>> @@ -1538,6 +1554,7 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
>>>> goto err_attach;
>>>>
>>>> mutex_unlock(&dev->intf_state_mutex);
>>>> + mlx5_devl_apply_default_esw_mode(dev);
>>>> return 0;
>>>>
>>>> err_attach:
>>>> --
>>>> 2.44.0
>>>>
>>