Re: [PATCH 00/15] Adding GAUDI NIC code to habanalabs driver

From: Florian Fainelli
Date: Thu Sep 10 2020 - 17:24:05 EST




On 9/10/2020 2:15 PM, Oded Gabbay wrote:
On Fri, Sep 11, 2020 at 12:05 AM Florian Fainelli <f.fainelli@xxxxxxxxx> wrote:



On 9/10/2020 1:32 PM, Oded Gabbay wrote:
On Thu, Sep 10, 2020 at 11:28 PM Jakub Kicinski <kuba@xxxxxxxxxx> wrote:

On Thu, 10 Sep 2020 23:16:22 +0300 Oded Gabbay wrote:
On Thu, Sep 10, 2020 at 11:01 PM Jakub Kicinski <kuba@xxxxxxxxxx> wrote:
On Thu, 10 Sep 2020 19:11:11 +0300 Oded Gabbay wrote:
create mode 100644 drivers/misc/habanalabs/gaudi/gaudi_nic.c
create mode 100644 drivers/misc/habanalabs/gaudi/gaudi_nic.h
create mode 100644 drivers/misc/habanalabs/gaudi/gaudi_nic_dcbnl.c
create mode 100644 drivers/misc/habanalabs/gaudi/gaudi_nic_debugfs.c
create mode 100644 drivers/misc/habanalabs/gaudi/gaudi_nic_ethtool.c
create mode 100644 drivers/misc/habanalabs/gaudi/gaudi_phy.c
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_qm0_masks.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_qm0_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_qm1_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_qpc0_masks.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_qpc0_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_qpc1_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_rxb_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_rxe0_masks.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_rxe0_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_rxe1_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_stat_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_tmr_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_txe0_masks.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_txe0_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_txe1_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_txs0_masks.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_txs0_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic0_txs1_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic1_qm0_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic1_qm1_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic2_qm0_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic2_qm1_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic3_qm0_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic3_qm1_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic4_qm0_regs.h
create mode 100644 drivers/misc/habanalabs/include/gaudi/asic_reg/nic4_qm1_regs.h
create mode 100644 drivers/misc/habanalabs/include/hw_ip/nic/nic_general.h

The relevant code needs to live under drivers/net/(ethernet/).
For one thing our automation won't trigger for drivers in random
(/misc) part of the tree.

Can you please elaborate on how to do this with a single driver that
is already in misc ?
As I mentioned in the cover letter, we are not developing a
stand-alone NIC. We have a deep-learning accelerator with a NIC
interface.
Therefore, we don't have a separate PCI physical function for the NIC
and I can't have a second driver registering to it.

Is it not possible to move the files and still build them into a single
module?
hmm...
I actually didn't try that as I thought it will be very strange and
I'm not familiar with other drivers that build as a single ko but have
files spread out in different subsystems.
I don't feel it is a better option than what we did here.

Will I need to split pull requests to different subsystem maintainers
? For the same driver ?
Sounds to me this is not going to fly.

Not necessarily, you can post your patches to all relevant lists and
seek maintainer review/acked-by tags from the relevant maintainers. This
is not unheard of with mlx5 for instance.
Yeah, I see what you are saying, the problem is that sometimes,
because everything is tightly integrated in our SOC, the patches
contain code from common code (common to ALL our ASICs, even those who
don't have NIC at all), GAUDI specific code which is not NIC related
and the NIC code itself.
But I guess that as a last resort if this is a *must* I can do that.
Though I would like to hear Greg's opinion on this as he is my current
maintainer.

Personally I do want to send relevant patches to netdev because I want
to get your expert reviews on them, but still keep the code in a
single location.

We do have network drivers sprinkled across the kernel tree already, but I would agree that from a networking maintainer perspective this makes auditing code harder, you would naturally grep for net/ and drivers/net and easily miss arch/uml/ for instance. When you do treewide changes, having all your ducklings in the same pond is a lot easier.

There is a possible "risk" with posting a patch series for the habanalabs driver to netdev that people will be wondering what this is about and completely miss it is about the networking bits. If there is a NIC driver under drivers/net then people will start to filter or pay attention based on the directory.



Have you considered using notifiers to get your NIC driver registered
while the NIC code lives in a different module?
Yes, and I prefered to keep it simple. I didn't want to start sending
notifications to the NIC driver every time, for example, I needed to
reset the SOC because a compute engine got stuck. Or vice-versa - when
some error happened in the NIC to start sending notifications to the
common driver.

In addition, from my AMD days, we had a very tough time managing two
drivers that "talk" to each other and manage the same H/W. I'm talking
about amdgpu for graphics and amdkfd for compute (which I was the
maintainer). AMD is working in the past years to unite those two
drivers to get out of that mess. That's why I didn't want to go down
that road.

You are trading an indirect call for a direct call, and it does provide some nice interface, but it could be challenging to work with given the context in which the notifier is called can be problematic. You could still have direct module references then, and that would avoid the need for notifiers.

You are the driver maintainer, so you definitively have a bigger say in the matter than most of us, drive by contributors.
--
Florian