On 2/20/2025 5:45 PM, Jakub Kicinski wrote:
On Wed, 19 Feb 2025 17:32:54 +0100 Przemek Kitszel wrote:
Add a support for whole device devlink instance. Intented as a entity
over all PF devices on given physical device.
In case of ice driver we have multiple PF devices (with their devlink
dev representation), that have separate drivers loaded. However those
still do share lots of resources due to being the on same HW. Examples
include PTP clock and RSS LUT. Historically such stuff was assigned to
PF0, but that was both not clear and not working well. Now such stuff
is moved to be covered into struct ice_adapter, there is just one instance
of such per HW.
This patch adds a devlink instance that corresponds to that ice_adapter,
to allow arbitrage over resources (as RSS LUT) via it (further in the
series (RFC NOTE: stripped out so far)).
Thanks to Wojciech Drewek for very nice naming of the devlink instance:
PF0: pci/0000:00:18.0
whole-dev: pci/0000:00:18
But I made this a param for now (driver is free to pass just "whole-dev").
Which only works nicely if you're talking about functions not full
separate links :) When I was thinking about it a while back my
intuition was that we should have a single instance, just accessible
under multiple names. But I'm not married to that direction if there
are problems with it.
I would also prefer to see a single devlink instance + one port for each
function. I think thats the most natural fit to how devlink works, and
it gives us a natural entry point for "whole device" configuration. It
also limits the amount of duplicate data, for example "devlink dev info"
reports once for each function.
The main things I think this causes problems for are:
1) PCIe direct assignment with IOV
This could be an issue in cases where someone assigns only one function
to a VM. The VM would only see one function and the functions outside
the VM would not interact with it. IMHO this is not a big deal as I
think simply assigning the entire device into the VM is more preferable.
We also already have this issue with ice_adapter, and we've seen that we
need to do this in order to make the device and software function
properly. Assigning single functions does not make much sense to me. In
addition, there is SR-IOV if you want to assign a portion of the device
to a VM.
2) locking may get complicated
If we have entry point which needs to interact with ice_pf data the
locking could get a little complicated, but I think this is also an
issue we can solve with ice_adapter, as a natural place to put
whole-device functionality.
I have also investigated in the past if it was possible to make the PCI
bus subsystem wrap the functions together somehow to represent them to
the host as a sort of pseudo "single-function" even tho the hardware is
multi-function. This seemed like a natural way to prevent direct
assignment of the whole device.. but I was never able to figure out how
to even start on such a path.
$ devlink dev # (Interesting part of output only)
pci/0000:af:00:
nested_devlink:
pci/0000:af:00.0
pci/0000:af:00.1
pci/0000:af:00.2
pci/0000:af:00.3
pci/0000:af:00.4
pci/0000:af:00.5
pci/0000:af:00.6
pci/0000:af:00.7
Could you go into more details on what stays on the "nested" instances
and what moves to the "whole-dev"? Jiri recently pointed out to y'all
cases where stuff that should be a port attribute was an instance
attribute.
I suspect this is a case of "we have separate devlink instances per
function, so we just put it in the devlink".