Re: [PATCH rdma-next] RDMA/rdmavt: Decouple QP and SGE lists allocations

From: Jason Gunthorpe
Date: Fri May 14 2021 - 10:35:22 EST

On Fri, May 14, 2021 at 10:07:43AM -0400, Dennis Dalessandro wrote:
> > IMHO if hf1 has a performance need here it should chain a sub
> > allocation since promoting node awareness to the core code looks
> > not nice..
> That's part of what I want to understand. Why is it "not nice"? Is it
> because there is only 1 driver that needs it or something else.

Because node allocation is extremely situational. Currently the kernel
has some tunable automatic heuristic, overriding it should only be
done in cases where the code knows absolutely that a node is the
correct thing, for instance because an IRQ pinned to a specific node
is the main consumer of the data.

hfi1 might have some situation where putting the QP on the device's
node makes sense, while another driver might work better with the QP
on the user thread that owns it. Who knows, it depends on the access

How do I sort this out in a generic way without making a big mess?

And why are you so sure that node allocation is the right thing for
hfi?? The interrupts can be rebalanced by userspace and user threads
can be pinned to other nodes. Why is choosing the device node
unconditionally the right choice?

This feels like something that was designed to benifit a very
constrained use case and harm everything else.

> As far as chaining a sub allocation, I'm not sure I follow. Isn't that kinda
> what Leon is doing here? Or will do, in other words move the qp allocation
> to the core and leave the SGE allocation in the driver per node. I can't say
> for any certainty one way or the other this is OK. I just know it would
> really suck to end up with a performance regression for something that was
> easily avoided by not changing the code behavior. A regression in code that
> has been this way since day 1 would be really bad. I'd just really rather
> not take that chance.

It means put the data you know is performance sensitive in a struct
and then allocate that struct and related on the node that is
guarenteed to be touching that data. For instance if you have a pinned
workqueue or IRQ or something.

The core stuff in ib_qp is not performance sensitive and has no
obvious node affinity since it relates primarily to simple control

> I would love to be able to go back in our code reviews and bug tracking and
> tell you exactly why this line of code was changed to be per node.
> Unfortunately that level of information has not passed on to Cornelis.

Wow, that is remarkable