Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver

From: Williams, Dan J
Date: Wed Sep 16 2020 - 19:04:55 EST

On Wed, 2020-09-16 at 10:22 +-0200, Greg Kroah-Hartman wrote:
+AD4- On Wed, Sep 16, 2020 at 11:02:39AM +-0300, Oded Gabbay wrote:
+AD4- +AD4- On Wed, Sep 16, 2020 at 10:41 AM Greg Kroah-Hartman
+AD4- +AD4- wrote:
+AD4- +AD4- +AD4- On Wed, Sep 16, 2020 at 09:36:23AM +-0300, Oded Gabbay wrote:
+AD4- +AD4- +AD4- +AD4- On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman
+AD4- +AD4- +AD4- +AD4- wrote:
+AD4- +AD4- +AD4- +AD4- +AD4- On Tue, Sep 15, 2020 at 11:49:12PM +-0300, Oded Gabbay wrote:
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- On Tue, Sep 15, 2020 at 11:42 PM David Miller +ADw-
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- wrote:
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- From: Oded Gabbay
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- Date: Tue, 15 Sep 2020 20:10:08 +-0300
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- This is the second version of the patch-set to upstream
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- the GAUDI NIC code
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- into the habanalabs driver.
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- The only modification from v2 is in the ethtool patch
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- (patch 12). Details
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- are in that patch's commit message.
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- Link to v2 cover letter:
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- I agree with Jakub, this driver definitely can't go-in as
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- it is currently
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- structured and designed.
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- Why is that ?
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- Can you please point to the things that bother you or not
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- working correctly?
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- I can't really fix the driver if I don't know what's wrong.
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- In addition, please read my reply to Jakub with the
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- explanation of why
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- we designed this driver as is.
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- And because of the RDMA'ness of it, the RDMA
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- folks have to be CC:'d and have a chance to review this.
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- As I said to Jakub, the driver doesn't use the RDMA
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- infrastructure in
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- the kernel and we can't connect to it due to the lack of
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- H/W support
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- we have
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- Therefore, I don't see why we need to CC linux-rdma.
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- I understood why Greg asked me to CC you because we do
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- connect to the
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- netdev and standard eth infrastructure, but regarding the
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- RDMA, it's
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- not really the same.
+AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- Ok, to do this +ACI-right+ACI- it needs to be split up into separate
+AD4- +AD4- +AD4- +AD4- +AD4- drivers,
+AD4- +AD4- +AD4- +AD4- +AD4- hopefully using the +ACI-virtual bus+ACI- code that some day Intel
+AD4- +AD4- +AD4- +AD4- +AD4- will resubmit
+AD4- +AD4- +AD4- +AD4- +AD4- again that will solve this issue.
+AD4- +AD4- +AD4- +AD4- Hi Greg,
+AD4- +AD4- +AD4- +AD4- Can I suggest an alternative for the short/medium term ?
+AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- In an earlier email, Jakub said:
+AD4- +AD4- +AD4- +AD4- +ACI-Is it not possible to move the files and still build them into
+AD4- +AD4- +AD4- +AD4- a single
+AD4- +AD4- +AD4- +AD4- module?+ACI-
+AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- I thought maybe that's a good way to progress here ?
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- Cross-directory builds of a single module are crazy. Yes, they
+AD4- +AD4- +AD4- work,
+AD4- +AD4- +AD4- but really, that's a mess, and would never suggest doing that.
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- First, split the content to Ethernet and RDMA.
+AD4- +AD4- +AD4- +AD4- Then move the Ethernet part to drivers/net but build it as part
+AD4- +AD4- +AD4- +AD4- of
+AD4- +AD4- +AD4- +AD4- habanalabs.ko.
+AD4- +AD4- +AD4- +AD4- Regarding the RDMA code, upstream/review it in a different
+AD4- +AD4- +AD4- +AD4- patch-set
+AD4- +AD4- +AD4- +AD4- (maybe they will want me to put the files elsewhere).
+AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- What do you think ?
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- I think you are asking for more work there than just splitting
+AD4- +AD4- +AD4- out into
+AD4- +AD4- +AD4- separate modules :)
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- thanks,
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- greg k-h
+AD4- +AD4- Hi Greg,
+AD4- +AD4-
+AD4- +AD4- If cross-directory building is out of the question, what about
+AD4- +AD4- splitting into separate modules ? And use cross-module
+AD4- +AD4- notifiers/calls
+AD4- +AD4- ? I did that with amdkfd and amdgpu/radeon a couple of years back.
+AD4- +AD4- It
+AD4- +AD4- worked (that's the best thing I can say about it).
+AD4- That's fine with me.
+AD4- +AD4- The main problem with this +ACI-virtual bus+ACI- thing is that I'm not
+AD4- +AD4- familiar with it at all and from my experience I imagine it would
+AD4- +AD4- take
+AD4- +AD4- a considerable time and effort to upstream this infrastructure
+AD4- +AD4- work.
+AD4- It shouldn't be taking that long, but for some unknown reason, the
+AD4- original author of that code is sitting on it and not resending
+AD4- it. Go
+AD4- poke them through internal Intel channels to find out what the
+AD4- problem
+AD4- is, as I have no clue why a 200-300 line bus module is taking so long
+AD4- to
+AD4- get +ACI-right+ACI- :(

It turns out that they were caught between being deeply respectful of
your request to get another senior kernel developer to look at it
before sending it out, and deeply respectful of not disclosing that I
was out on bonding leave.

It just happened that I left before they could
get the latest version over to review.

+AD4- I'm +AF8-ALMOST+AF8- at the point where I would just do that work myself, but
+AD4- due to my current status with Intel, I'll let them do it as I have
+AD4- enough other things on my plate...

I'm back now, let's get this thing moving. /me goes to review.