Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

From: Aron Silverton
Date: Tue Dec 05 2023 - 12:11:31 EST


Hi Jakub,

On Mon, Dec 04, 2023 at 06:52:10PM -0800, Jakub Kicinski wrote:
> On Mon, 4 Dec 2023 15:37:56 -0600 Aron Silverton wrote:
> > > To Jakub's point, no, we don't care about enterprise distros, they are a
> > > consumer of our releases and while some of them pay the salaries of our
> > > developers, they really don't have much influence over our development
> > > rules as they are just so far behind in releases that their releases
> > > look nothing like what we do in places (i.e. Linux "like" just like many
> > > Android systems.)
> > >
> > > If the enterprise distros pop in here and give us their opinions of the
> > > patchset, I would GREATLY appreciate it, as having more people review
> > > code at this point in time would be most helpful instead of having to
> > > hear about how the interfaces are broken 2 years from now.
> >
> > We will be happy to test and review v4 of this series.
> >
> > Fully interactive debugging has become essential to getting to the root
> > cause of complex issues that arise between the types of devices being
> > discussed and their interactions with various software layers. Turning
> > knobs and dumping registers just isn't sufficient, and I wish we'd had
> > this capability long ago.
>
> Could you shed some light on what issues you were debugging, broadly?
>

I'll do my best, with two recent instances:

1. As mentioned already, we recently faced a complex problem with RDMA
in KVM and were getting nowhere trying to debug using the usual methods.
Mellanox support was able to use this debug interface to see what was
happening on the PCI bus and prove that the issue was caused by
corrupted PCIe transactions. This finally put the investigation on the
correct path. The debug interface was used consistently and extensively
to test theories about what was happening in the system and, ultimately,
allowed the problem to be solved.

2. We've faced RDMA issues related to lost EQ doorbells, requiring
complex debug, and ultimately root-caused as a defective CPU. Without
interactive access to the device allowing us to test theories like,
"what if we manually restart the EQ", we could not have proven this
definitively.

> > Our customers have already benefited from the interactive debugging
> > capability that these patches provide, but the full potential won't be
> > realized until this is upstream.
>
> Can you elaborate on why "full potential won't be realized until this
> is upstream"? The driver looks very easy to ship as a standalone module.

Firstly, We believe in working upstream and all of the advantages that
that brings to all the distros as well as to us and our customers.

Secondly, Our cloud business offers many types of machine instances,
some with bare metal/vfio mlx5 devices, that require customer driven
debug and we want our customers to have the freedom to choose which OS
they want to use.

Aron