Re: [PATCH net-next 3/3] net/mlx5: Apply devlink default eswitch mode during init

From: Jiri Pirko

Date: Wed May 27 2026 - 07:21:07 EST


Wed, May 27, 2026 at 09:03:26AM +0200, mbloch@xxxxxxxxxx wrote:
>
>
>On 27/05/2026 8:14, Jiri Pirko wrote:
>> Tue, May 26, 2026 at 07:13:46PM +0200, mbloch@xxxxxxxxxx wrote:
>>>
>>>
>>> On 26/05/2026 19:23, Jiri Pirko wrote:
>>>> Tue, May 26, 2026 at 05:03:57PM +0200, mbloch@xxxxxxxxxx wrote:
>>>>>
>>>>>
>>>>> On 26/05/2026 17:07, Jiri Pirko wrote:
>>>>>> Tue, May 26, 2026 at 11:44:46AM +0200, mbloch@xxxxxxxxxx wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 26/05/2026 10:44, Jiri Pirko wrote:
>>>>>>>> Thu, May 21, 2026 at 09:24:34AM +0200, tariqt@xxxxxxxxxx wrote:
>>>>>>>>> From: Mark Bloch <mbloch@xxxxxxxxxx>
>>>>>>>>>
>>>>>>>>> Apply devlink default eswitch mode for mlx5 devices after successful
>>>>>>>>> device initialization while holding the devlink instance lock.
>>>>>>>>>
>>>>>>>>> At this point the devlink instance is registered and the mlx5 devlink
>>>>>>>>> operations are available, so the default eswitch mode can be applied to
>>>>>>>>> the matching PCI devlink handle.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Mark Bloch <mbloch@xxxxxxxxxx>
>>>>>>>>> Reviewed-by: Shay Drori <shayd@xxxxxxxxxx>
>>>>>>>>> Reviewed-by: Moshe Shemesh <moshe@xxxxxxxxxx>
>>>>>>>>> Signed-off-by: Tariq Toukan <tariqt@xxxxxxxxxx>
>>>>>>>>> ---
>>>>>>>>> drivers/net/ethernet/mellanox/mlx5/core/main.c | 17 +++++++++++++++++
>>>>>>>>> 1 file changed, 17 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>>>>>> index 0c6e4efe38c8..4528097f3d84 100644
>>>>>>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>>>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
>>>>>>>>> @@ -1391,6 +1391,21 @@ static void mlx5_unload(struct mlx5_core_dev *dev)
>>>>>>>>> mlx5_free_bfreg(dev, &dev->priv.bfreg);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> +static void mlx5_devl_apply_default_esw_mode(struct mlx5_core_dev *dev)
>>>>>>>>> +{
>>>>>>>>> + struct devlink *devlink = priv_to_devlink(dev);
>>>>>>>>> + int err;
>>>>>>>>> +
>>>>>>>>> + if (!MLX5_ESWITCH_MANAGER(dev))
>>>>>>>>> + return;
>>>>>>>>> +
>>>>>>>>> + devl_assert_locked(devlink);
>>>>>>>>> + err = devl_apply_default_esw_mode(devlink);
>>>>>>>>> + if (err)
>>>>>>>>> + mlx5_core_warn(dev, "Couldn't apply default eswitch mode, err %d\n",
>>>>>>>>> + err);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>>>>>>> {
>>>>>>>>> bool light_probe = mlx5_dev_is_lightweight(dev);
>>>>>>>>> @@ -1437,6 +1452,7 @@ int mlx5_init_one_devl_locked(struct mlx5_core_dev *dev)
>>>>>>>>> mlx5_core_err(dev, "mlx5_hwmon_dev_register failed with error code %d\n", err);
>>>>>>>>>
>>>>>>>>> mutex_unlock(&dev->intf_state_mutex);
>>>>>>>>> + mlx5_devl_apply_default_esw_mode(dev);
>>>>>>>>
>>>>>>>> I wonder how we can make this work for all. I mean, other driver would
>>>>>>>> silently ignore this command like arg, right? Any idea how to make all
>>>>>>>> drivers follow the arg from very beginning?
>>>>>>>>
>>>>>>>
>>>>>>> I have a follow-up series that adds the call to all drivers which support
>>>>>>> setting eswitch mode. When going over the other drivers, what I found is
>>>>>>> that the right point to apply the default is driver specific, drivers
>>>>>>> I have patch for:
>>>>>>>
>>>>>>> 46e16c6d9836 net: Apply devlink esw mode defaults
>>>>>>> ab4f54102ba9 bnxt_en: Apply devlink default eswitch mode during init
>>>>>>> b48cce1607bb liquidio: Apply devlink default eswitch mode during init
>>>>>>> 4ea54b0fe04a ice: Apply devlink default eswitch mode during init
>>>>>>> b7faddaa1c90 octeontx2-af: Apply devlink default eswitch mode during init
>>>>>>> 74b0c22c47b9 octeontx2-pf: Apply devlink default eswitch mode during init
>>>>>>> 5000e4c3d768 nfp: Apply devlink default eswitch mode during init
>>>>>>> 97a218e95e41 netdevsim: Apply devlink default eswitch mode during init
>>>>>>>
>>>>>>> I don't think doing this generically from devlink is realistic. devlink
>>>>>>> doesn't really know when a given driver is ready to change eswitch mode.
>>>>>>> Some drivers need SR-IOV state, representor setup, or other init pieces to
>>>>>>> be ready first, and the locking is not identical across drivers either.
>>>>>>
>>>>>>
>>>>>> Low hanging fruit would be just to call ops->eswitch_mode_set at the end
>>>>>> of register. Multiple reasons:
>>>>>>
>>>>>> 1) end of devl_register is exactly the point userspace is free to issue
>>>>>> the eswitch mode set. Driver should be ready to handle it.
>>>>>> 2) all drivers would transparently get this functionality, without
>>>>>> actually knowing this kernel command line arg ever existed, without
>>>>>> odd wiring call of related exported function. I prefer that stongly.
>>>>>> 3) you should add a there warning for the case this arg is passed yet
>>>>>> the driver does not implement eswitch_mode_set. User should
>>>>>> get a feedback like this, not silent ignore.
>>>>>>
>>>>>> The only loose end is see it the void return of devl_register().
>>>>>> Multiple ways to handle the possibly failed eswitch_mode_set(). I would
>>>>>> probably just go for pr_warn, seems to be the most correct.
>>>>>>
>>>>>> Make sense?
>>>>>
>>>>> I see the point, but I don't think devl_register() (at least not the only place)
>>>>> is the right place.
>>>>>
>>>>> There is a small but important difference between userspace doing
>>>>> "devlink eswitch set" after register is done, and devlink core calling
>>>>> eswitch_mode_set() from inside the register flow.
>>>>>
>>>>> Some drivers call devlink_register() while holding the device lock.
>>>>> liquidio is one example. If devlink core calls ops->eswitch_mode_set() from
>>>>> there, we may start the full eswitch mode change while holding that lock.
>>>>> That mode change can create representors, register netdevs, take rtnl,
>>>>> allocate resources, etc. I don't think we want this to become an implicit
>>>>> side effect of devlink registration.
>>>>
>>>> I believe your AI may untagle liquidio locking :)
>>>
>>> I didn't try to solve that one with ai. Most drivers were fairly simple
>>> so I didn't use ai at all. bnxt was the one where I needed a bit of help :)
>>>
>>>>
>>>>
>>>>>
>>>>> For mlx5, the placement after intf_state_mutex is also intentional:
>>>>>
>>>>> mutex_unlock(&dev->intf_state_mutex);
>>>>> mlx5_devl_apply_default_esw_mode(dev);
>>>>>
>>>>> We can't call it while holding intf_state_mutex because the mode set path
>>>>> takes it internally, and switchdev mode may also create IB representors.
>>>>>
>>>>> Also, devl_register() only covers the first registration. The mlx5 call in
>>>>> mlx5_load_one_devl_locked() is for reload/fw reset recovery kind of flows.
>>>>> In those flows devlink is already registered, so devl_register() is not
>>>>> called again, but the driver state was rebuilt and we may need to apply the
>>>>> default again.
>>>>
>>>> Call it from reload too, right?
>>>
>>> Yes, that was my first thought: apply it from devl_register() for the first
>>> registration and from devlink_reload() after a successful DRIVER_REINIT.
>>>
>>> That covers the clean devlink reload path but....(see bellow)
>>>
>>>>
>>>>
>>>>>
>>>>> Same for reload, fw reset and pci recovery in general. If the driver tears
>>>>> down and rebuilds eswitch related state, the place to apply the default is
>>>>> in that driver's reinit flow, not in devl_register().
>>>>>
>>>>> When I went over the other drivers, the right place was not always the same
>>>>> as devlink registration. I'm not an expert in any of them, so I hope I got
>>>>> the details right, but for example octeontx2 AF needs sr-iov and the
>>>>> representor switch state to be initialized first. nfp can do it after
>>>>> app/vNIC init while the devlink lock is already held. liquidio should do it
>>>>> only after dropping the PCI device lock.
>>>>
>>>> Idk, perhaps do it from devlink_post_register_work of some kind? That
>>>> would allow you to have the same locking ordering as a userspace cal
>>> l.
>>>
>>> I thought about a workqueue too, it was actually my first idea.
>>>
>>> The problem is that then we race with userspace. In the mlx5 version here the
>>> default is applied while the devlink lock is still held, before userspace can
>>> come in and issue its own eswitch set. If we defer it to post-register work,
>>> the devlink instance is already visible and userspace can get there first
>>> and then we might change the user configuration.
>>
>> Figure that out and expose to user by setting xa_mark only after the
>> work is done? This is doable.
>
>I agree that if devlink can keep the instance hidden/unavailable until the
>post register work is done, that solves the initial userspace race.
>
>The other part is the reinit/recovery case. For that I think devlink core
>needs some explicit indication from the driver that the device is now in
>reinit. Something like (at least that's the code I had initially, but something
>along those lines):
>
>void devl_dev_reinit_begin(struct devlink *devlink);
>void devl_dev_reinit_end(struct devlink *devlink);
>void devl_dev_reinit_abort(struct devlink *devlink);
>
>The core can then mark the instance as temporarily unavailable/in reinit
>between begin/end, and the relevant lookup/dump paths, for example
>devlink_get_from_attrs_lock() and devlink_nl_inst_iter_dumpit(), can reject
>or skip it while reinit is in progress. devlink_reload() can probably mark
>this state by itself around DRIVER_REINIT.

I believe this is orthogonal to the problem you are trying to solve in
this patchset. Not sure why you bring it in to the conversation...


>
>Then mlx5 would look more or less like:
> devl_lock(devlink);
> devl_dev_reinit_begin(devlink);
> ret = mlx5_load_one_devl_locked(dev, recovery);
> if (!ret)
> devl_dev_reinit_end(devlink);
> else
> devl_dev_reinit_abort(devlink);
> devl_unlock(devlink);
>
>This gives devlink core a way to know that the devlink instance is registered,
>but should not be used by userspace at the moment. It also allows keeping the
>default/config apply logic in devlink instead of adding driver specific calls
>to apply it in each init path.
>
>But this still means the generic solution needs some driver help. Drivers need
>to register devlink at a point where the post-register default apply is safe,
>and full reinit paths need to be marked with this begin/end API.
>
>>
>>
>>>
>>> Also, the bigger issue for mlx5 is not only initial registration or devlink
>>> reload. Some recovery paths, pci resume, and fw reset flows rebuild the driver
>>> state without going through devlink at all. I did not find a clean way for
>>> devlink core to infer all those points by itself.
>>
>> If you don't obey current configuration for example in pci resume, it is
>> bug and you should fix it. All these flows should obey current eswitch
>> mode configuration.
>>
>
>I agree that the device should come back according
>to the intended high level policy. But I don't think full reinit can be treated
>as restoring the whole previous runtime state. There may be user created
>steering rules and other objects which the driver cannot keep or replay. Today
>full reinit brings the device back to a clean initialized state, and that is
>intentional.
>
>So the split I have in mind is:
>
>- full runtime state is not preserved across full reinit;
>- high level devlink policy/configuration should be applied when the device is
> initialized again;
>- the command line default should not blindly override a later explicit
> userspace eswitch mode selection.
>
>I am not against moving this into devlink core, and I am willing to work on it.
>
>But before I rework the series, I want to make sure we agree on the direction.
>As I see it, doing this cleanly needs a devlink state like "registered but
>unavailable/in reinit", plus driver annotations for the reinit paths.
>
>If this is not the direction you want, I prefer to know now rather than spend
>time on a version that will be rejected anyway.
>
>Mark
>
>>
>>>
>>> To handle that from devlink I would still need to add some api for the driver
>>> to tell devlink "I just reinitialized, apply the default now". but nce I had
>>> that driver call , it felt simpler and clearer to let the driver call
>>> the helper directly at the points where it knows eswitch mode is safe.
>>>
>>> I agree that handling all of this inside devlink would be the better option.
>>> I just couldn't make it work in a clean way.
>>>
>>> Mark
>>>
>>>>
>>>>>
>>>>> Mark
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Also, since this knob is only about eswitch mode, I don't think we need to
>>>>>>> touch every devlink driver. Drivers that don't implement eswitch_mode_set()
>>>>>>> would just ignore it anyway. The follow-up only wires the default into
>>>>>>> drivers that actually support changing eswitch mode.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>>>
>>>>>>>>> return 0;
>>>>>>>>>
>>>>>>>>> err_register:
>>>>>>>>> @@ -1538,6 +1554,7 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
>>>>>>>>> goto err_attach;
>>>>>>>>>
>>>>>>>>> mutex_unlock(&dev->intf_state_mutex);
>>>>>>>>> + mlx5_devl_apply_default_esw_mode(dev);
>>>>>>>>> return 0;
>>>>>>>>>
>>>>>>>>> err_attach:
>>>>>>>>> --
>>>>>>>>> 2.44.0
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>