Re: [RFC net-next 0/4] devlink: Add boot-time defaults

From: Jiri Pirko

Date: Mon May 11 2026 - 04:08:55 EST

Sun, May 10, 2026 at 02:31:35PM +0200, mbloch@xxxxxxxxxx wrote:
>
>
>On 09/05/2026 10:01, Jiri Pirko wrote:
>> Sat, May 09, 2026 at 02:52:13AM +0200, kuba@xxxxxxxxxx wrote:
>>> On Fri, 8 May 2026 20:07:44 +0200 Jiri Pirko wrote:
>>>>> I don't think switchdev by default should mean CX4+ in general. If we get
>>>>> there, I would expect it to be limited to the DPU/BlueField/ECPF case, where
>>>>> the host PF probe path can depend on the ECPF reaching switchdev. Changing the
>>>>> default for regular host NIC deployments feels like a much larger compatibility
>>>>> change.
>>>>
>>>> We can't travel throught time, but if from CX5 onwards the default would
>>>> be switchdev, nobody would feel broken in terms of compatibility. That
>>>> is my point. Having "legacy" as default is simply wrong for never NIC
>>>> generations. That is why it is called "legacy" and it should have been
>>>> rotten through and out since CX4 times.
>>>
>>> legacy vs switchdev only describes the eswitch configuration.
>>> As a non-SR-IOV user I really don't want to see the extra representors
>>> hanging around my systems, confusing all daemons. IIRC mlx5 had some
>>> limitations around the uplink representor. Maybe that's the disconnect.
>>> But for a real, fully featured switchdev eswitches having the
>>> PHY and PF representors on boot, always, will not make sense.
>>
>> As "a non-SR-IOV user", what extra representors you talk about? When you
>> have pfs only, you don't have anything extra. Just 1 netdev per-pf, one
>> devlink port per-pf. What's extra about it? When you don't have VFs/SFs.
>> Everyhing is the same:
>
>The netdev list looking similar is a bit misleading. What matters here is
>not only how many netdevs show up, but what that netdev actually is.
>
>In legacy mode, a PF only user can just use the PF netdev as a regular NIC
>and use ROCE on it directly.

I don't see why we have this limitation. Sounds more like a bug to me.
The netdev is still the same, capable of the same things no matter in
which mode you have it. RoCE should work on it in both modes.

>
>In switchdev mode, even if there are no VFs or SFs yet, the PF is moved into
>the switchdev model and the visible netdev is the uplink representor. That is
>not the same thing from a user point of view. The uplink representor is not a
>ROCE capable endpoint. So a user who used to boot the machine and use ROCE on
>the PF now has to create a VF or SF, use that as the roce endpoint, and also
>set up the switchdev forwarding path with tc, bridge or OVS so traffic from
>that function actually reaches the wire.
>
>That is why I don't think this is only a card generation question. It changes
>the deployment model. It may be the right default for BlueField/ECPF style
>systems, where the host is expected to sit behind a switchdev control plane,
>but it is not a safe default for every regular host NIC setup.

Yeah, the point is, not to change deployment model. The legacy/switchdev
should only change behaviour for sriov/eswitch usecase. The rest
(PF/uplink netdev and related objects) should stay the same.

>
>>
>> c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.0
>> pci/0000:08:00.0: mode switchdev inline-mode none encap-mode basic
>> c-220-136-220-218:~$ sudo devlink dev eswitch show pci/0000:08:00.1
>> pci/0000:08:00.1: mode legacy inline-mode none encap-mode basic
>> c-220-136-220-218:~$ devlink dev
>> pci/0000:08:00.0: index 0
>> nested_devlink:
>> auxiliary/mlx5_core.eth.0
>> devlink_index/1: index 1
>> nested_devlink:
>> pci/0000:08:00.0
>> pci/0000:08:00.1
>> auxiliary/mlx5_core.eth.0: index 2
>> pci/0000:08:00.1: index 3
>> nested_devlink:
>> auxiliary/mlx5_core.eth.1
>> auxiliary/mlx5_core.eth.1: index 4
>> c-220-136-220-218:~$ devlink port
>> auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
>> auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
>> c-220-136-220-218:~$ ip link
>> ...
>> 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>> link/ether b8:e9:24:f2:b7:6c brd ff:ff:ff:ff:ff:ff
>> altname enp8s0f0np0
>> 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
>> link/ether b8:e9:24:f2:b7:6d brd ff:ff:ff:ff:ff:ff
>> altname enp8s0f1np1
>>
>>
>>>
>>> IOW it's not a question of the generation of the card but of
>>> the deployment type / use case.
>>
>> I don't think so, not in the case of mlx5. The difference is only when
>> you work with sr-iov, you either use legacy way (ip vf) or the new one.
>> Same usecase.
>>
>>
>>>
>>>>> For the ASIC/NV bit: maybe technically possible, but it feels like the wrong
>>>>> layer. This is boot/deployment policy, not a persistent hardware property, and
>>>>> storing it in NV memory would make the state persist across kernels/hosts in a
>>>>> surprising way.
>>>>
>>>> Well, as any other nv config, it persists across kernels/hosts. Think
>>>> about it as "unbreak-my-not-legacy-device" bit.
>>>
>>> For most devices the switchdev mode does not change anything
>>> substantial about the device. It's purely a kernel / driver config.
>>> It changes what objects and default rules kernel / driver installs.
>>> So I don't get why it would make sense to flash into the device
>>> nvmem a Linux SW stack specific config.
>>
>> I look at it from the perspective that from some CX generation,
>> switchdev mode should be default. So that is a device-based decision.
>> I believe as such it can optionally be permanenty configured (nv config)
>> on older device. Why not?
>
>This is a deployment policy decision, not a permanent property of the card.
>The same adapter can be used in a regular host/RDMA setup or in a
>switchdev/offload setup. If we store this in NVM, that Linux switchdev policy
>follows the device across hosts, kernels and use cases, and can surprise the
>next deployment that just expects a normal NIC.

Yeah, from my perspective, there should be not surprise/behaviour_change
for non-sriov/eswitch user. Then switchdev can be default and everyone
is happy. Why to complicate things?

>
>I'll send another RFC v2 with support limited to:
>devlink=[...]:esw:mode:{ switchdev | switchdev_inactive | legacy }
>and let's see where we land with that.
>
>I still think a small kernel command line knob is the cleanest way to get to
>"switchdev by default" without making the interface too broad. For more
>complex boot-time configuration, I agree that a devlinkd or similar userspace
>path is probably the better direction.
>
>The "pause probing until userspace configures devlink" idea feels less clear
>to me. It is not quite the simple boot policy knob, and not quite the full
>userspace policy manager either. It would add a new probe state and require
>early userspace orchestration before the device is fully materialized. At
>least for now, I would prefer either the small cmdline option for the simple
>global/default case, or a proper devlinkd-like solution for more complex
>policy. Between those, I still prefer the cmdline option for this specific
>early eswitch mode default.
>
>Mark
>
>>
>> [...]
>