Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

From: David Ahern
Date: Thu Dec 07 2023 - 10:56:44 EST


On 12/5/23 9:48 PM, Jakub Kicinski wrote:
> On Tue, 5 Dec 2023 11:11:00 -0600 Aron Silverton wrote:
>> 1. As mentioned already, we recently faced a complex problem with RDMA
>> in KVM and were getting nowhere trying to debug using the usual methods.
>> Mellanox support was able to use this debug interface to see what was
>> happening on the PCI bus and prove that the issue was caused by
>> corrupted PCIe transactions. This finally put the investigation on the
>> correct path. The debug interface was used consistently and extensively
>> to test theories about what was happening in the system and, ultimately,
>> allowed the problem to be solved.
>
> You hit on an important point, and what is also my experience working
> at Meta. I may have even mentioned it in this thread already.
> If there is a serious issue with a complex device, there are two ways
> you can get support - dump all you can and send the dump to the vendor
> or get on a live debugging session with their engineers. Users' ability
> to debug those devices is practically non-existent. The idea that we
> need access to FW internals is predicated on the assumption that we
> have an ability to make sense of those internals.
>
> Once you're on a support call with the vendor - just load a custom
> kernel, module, whatever, it's already extremely expensive manual labor.

You rail against out of tree drivers and vendor proprietary tools, and
now you argue for just that. There is no reason debugging capabilities
can not be built into the OS and used when needed. That means anything
needed - from kernel modules to userspace tools.

The Meta data point is not representative of the world at large -
different scale, different needs, different expertise on staff (OS and
H/W). Getting S/W installed (especially anything requiring a compiler)
in a production server (and VMs) is not an easy request and in many
cases not even possible.

When a customer hits problem, the standard steps are to run a script,
generate a tar file and ship it to the OS vendor. Engineers at the OS
vendor go through it and may need other data - like getting detailed
dumps from individual pieces of H/W. Every time those requests require
going to a vendor web site to pull down vendor tools, get permission to
install them, schedule the run of said tool ... it only serves to drag
out the debugging process. ie., this open-ended stance only serves to
hurt Linux users.