Re: [PATCH V4 0/5] mlx5 ConnectX control misc driver

From: Jason Gunthorpe
Date: Tue Apr 02 2024 - 14:41:13 EST


On Tue, Apr 02, 2024 at 05:32:44PM +0100, Edward Cree wrote:
> On 26/03/2024 14:57, David Ahern wrote:
> > The proposal is an attempt at a common interface and common tooling to a
> > degree but independent of any specific subsystem of which many are
> > supported by the device.
>
> [ Let me prefix this by noting that I'm speaking personally here, and
> not representing the position of my employer. ]
>
> You can't have a "common interface" and yet be "independent" of anything
> that could give semantics to that interface. What you have is a common
> *transport*, used to connect to a *vendor* interface.
> If you can't even bring yourself to be honest about that, it's no wonder
> you're getting maintainer pushback.

I think this was covered in the doc I posted, it is unapologetically a
common transport.

I have some ideas for different "common interface's" built in
userspace. I think others will have good ideas too.

It is 'independent [of subsystems]' because it is manipulating *the
device* not the subsystem software.

> Do we need to go all the way back to operating systems 101 and point out
> that one of the fundamental jobs of a kernel is to *abstract* the
> hardware, and provide *services* to userspace rather than mere devices?

Except that isn't even true! That is a very naive and simplistic view
of what an OS should do. If we took that tonnes of interfaces in Linux
should just be purged.

A fundamental thing the OS exists to do is to arbitrate, secure and
multiplex access to hardware. There are many levels where people can
put a stick and say 'common interface', and it is not some 'wrong
architecture' for the OS to delegate the 'common interface' job to
userspace, or even say direct access is fine.

This is not an exclusive situation, fwctl doesn't mean future OS side
common interfaces are somehow blocked.

> Frankly, this whole thread reads to me like certain vendors whining that
> they weren't expecting to have to get their new features *reviewed* by
> upstream — possibly they thought devlink params would just get rubber-
> stamped — and now they're finding that the kernel's quality standards
> still apply.

Oh please. "quality standards" is not the issue here. This is a
philosophical disagreement on OS design and, as Jakub pointed out, a
second argument about who gets to have power in our community.

> Complaining that devlink params "don't scale" is disingenuous.

Saeed's remarks mean the review process doesn't scale. There are
something like 600-800 configurables in mlx5, I assume other devices
are similar. You can't fight over commonality bit by bit that many
times and ever get anywhere. Everyone will get exhausted well before
it is done.

> If all the configuration of these Complex Devices™ goes through fwctl
> backdoors, where exactly is anyone going to discover the commonalities
> to underlie the generic interfaces of the next generation? What would
> configuring plain vanilla netdevs be like today if, instead of a set of

It would be *exactly the same as today* because today everyone with
these devices uses the vendor tooling to configure them. Where is the
screaming? Where has the concrete demands for common interfaces on
some of the knobs been all these years? Where has keeping blessed
support out of the kernel got us?

If there was a real industry consensus to make commonality here it
would be done. In my view there is not industry will because it is not
actually an important problem that needs solving. When you buy this
kind of complex HW you need to ensure the flash is configured for your
site. You will work with the vendor and either get devices with flash
preconfigured, or you will work with the vendor to get a suitable FW
version and configurable list for your site. Then the job is simple,
userspace needs to confirm and fix the device to have the target flash
state.

The actual benefit of common names for the individual configuration
values is pretty tiny. Unless 100% of the configuration is covered by
common names it is not going to be meaningful in practice.

In my view there is alot of benefit to have a common tool to take
descriptions of devices target flash states and then enforce
it. Something that is common to many devices and could work for
everything, not just some useless subset of aruged-to-death-blessed
items.

> These commonalities are key to allowing a product category to
> mature.

Uh no. Alot of this configuration stuff is to serve niche customer
interests where the customers have no interest in publishing and
commonizing their method of operating because it would give away their
own operator secrets. The vendor's created it because their big
customers demanded it.

eg there are configurables in mlx5 that exist *soley* to accomodate
certain customer's pre-existing SW.

Stuff is like this because the *industry* doesn't want to fix it,
don't dump everything on the vendors like they are the source of the
problem.

And, as before, this message is talking about configuration and
devlink, but there are many more problems fwctl will solve with these
devices including debugging/etc that inherently can't be solved in an
abstract way.

> > It is obvious to everyone that in the AI era, everyone needs
> > customization
>
> It's always possible to argue that the New Thing is qualitatively
> different from anything that went before, that these "multibillion
> gate devices" need to be able to break the rules.

What supposed "rules"? Read my post again, this is well trodden ground
in the kernel. Delegation is fairly normal Linux design - try to
minimize the kernel footprint is also a well accepted design axiom.

Jason