Re: [RFC net-next 0/4] devlink: Add boot-time defaults

From: Jiri Pirko

Date: Sat May 09 2026 - 03:02:00 EST


Sat, May 09, 2026 at 02:52:13AM +0200, kuba@xxxxxxxxxx wrote:
>On Fri, 8 May 2026 20:07:44 +0200 Jiri Pirko wrote:
>> >I don't think switchdev by default should mean CX4+ in general. If we get
>> >there, I would expect it to be limited to the DPU/BlueField/ECPF case, where
>> >the host PF probe path can depend on the ECPF reaching switchdev. Changing the
>> >default for regular host NIC deployments feels like a much larger compatibility
>> >change.
>>
>> We can't travel throught time, but if from CX5 onwards the default would
>> be switchdev, nobody would feel broken in terms of compatibility. That
>> is my point. Having "legacy" as default is simply wrong for never NIC
>> generations. That is why it is called "legacy" and it should have been
>> rotten through and out since CX4 times.
>
>legacy vs switchdev only describes the eswitch configuration.
>As a non-SR-IOV user I really don't want to see the extra representors
>hanging around my systems, confusing all daemons. IIRC mlx5 had some
>limitations around the uplink representor. Maybe that's the disconnect.
>But for a real, fully featured switchdev eswitches having the
>PHY and PF representors on boot, always, will not make sense.

As "a non-SR-IOV user", what extra representors you talk about? When you
have pfs only, you don't have anything extra. Just 1 netdev per-pf, one
devlink port per-pf. What's extra about it? When you don't have VFs/SFs.
Everyhing is the same:

c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.0
pci/0000:08:00.0: mode switchdev inline-mode none encap-mode basic
c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.1
pci/0000:08:00.1: mode legacy inline-mode none encap-mode basic
c-220-136-220-218:~$ devlink dev
pci/0000:08:00.0: index 0
nested_devlink:
auxiliary/mlx5_core.eth.0
devlink_index/1: index 1
nested_devlink:
pci/0000:08:00.0
pci/0000:08:00.1
auxiliary/mlx5_core.eth.0: index 2
pci/0000:08:00.1: index 3
nested_devlink:
auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.1: index 4
c-220-136-220-218:~$ devlink port
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
c-220-136-220-218:~$ ip link
...
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether b8:e9:24:f2:b7:6c brd ff:ff:ff:ff:ff:ff
altname enp8s0f0np0
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether b8:e9:24:f2:b7:6d brd ff:ff:ff:ff:ff:ff
altname enp8s0f1np1


>
>IOW it's not a question of the generation of the card but of
>the deployment type / use case.

I don't think so, not in the case of mlx5. The difference is only when
you work with sr-iov, you either use legacy way (ip vf) or the new one.
Same usecase.


>
>> >For the ASIC/NV bit: maybe technically possible, but it feels like the wrong
>> >layer. This is boot/deployment policy, not a persistent hardware property, and
>> >storing it in NV memory would make the state persist across kernels/hosts in a
>> >surprising way.
>>
>> Well, as any other nv config, it persists across kernels/hosts. Think
>> about it as "unbreak-my-not-legacy-device" bit.
>
>For most devices the switchdev mode does not change anything
>substantial about the device. It's purely a kernel / driver config.
>It changes what objects and default rules kernel / driver installs.
>So I don't get why it would make sense to flash into the device
>nvmem a Linux SW stack specific config.

I look at it from the perspective that from some CX generation,
switchdev mode should be default. So that is a device-based decision.
I believe as such it can optionally be permanenty configured (nv config)
on older device. Why not?

[...]