Re: [PATCH V3 5/5] misc: mlx5ctl: Add umem reg/unreg ioctl

From: Saeed Mahameed
Date: Tue Nov 21 2023 - 17:46:58 EST


On 21 Nov 14:18, David Ahern wrote:
On 11/21/23 1:04 PM, Saeed Mahameed wrote:
On 21 Nov 12:44, Jakub Kicinski wrote:
On Mon, 20 Nov 2023 23:06:19 -0800 Saeed Mahameed wrote:
high frequency diagnostic counters

So is it a debug driver or not a debug driver?


High frequency _diagnostic_ counters are a very useful tool for
debugging a high performance chip. So yes this is for diagnostics/debug.

Because I'm pretty sure some people want to have access to high freq
counters in production, across their fleet. What's worse David Ahern
has been pitching a way of exposing device counters which would be
common across netdev.
.

For context on the `what's worse ...` comment for those who have not
seen the netconf slides:
https://netdev.bots.linux.dev/netconf/2023/david.pdf

and I am having a hard time parsing Kuba's intent with that comment here
(knowing you did not like the pitch I made at netconf :-))



This is not netdev, this driver is to support ConnectX chips and SoCs
with any stack, netdev/rdma/vdpa/virtio and internal chip units and
acceleration engines, add to that ARM core diagnostics in case of
Blue-Field DPUs.
I am not looking for counting netdev ethernet packets in this driver.

I am also pretty sure David will also want an interface to access other
than netdev counters, to get more visibility on how a specific chip is
behaving.

yes, and h/w counters were part of the proposal. One thought is to
leverage userspace registered memory with the device vs mapping bar
space, but we have not moved beyond a theoretical discussion at this point.


Definite nack on this patch.

Based on what ?

It's a generic interface argument?


For this driver the diagnostic counters is only a small part of the debug
utilities the driver provides, so it is not fair to nak this patch based
on one use-case, we need this driver to also dump other stuff like
core dumps, FW contexts, internal objects, register dumps, resource dumps,
etc ..

This patch original purpose was to allow core dumps, since core dump can go
up to 2MB of memory, without this patch we won't have core dump ability
which is more important for debugging than diagnostic counters.

You can find more here:
https://github.com/saeedtx/mlx5ctl#mlx5ctl-userspace-linux-debug-utilities-for-mlx5-connectx-devices

For diagnostic counters we can continue the discussion to have a generic
interface I am all for it, but it's irrelevant for this submission.