Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

From: Christian KÃnig
Date: Fri May 04 2018 - 10:28:00 EST


Am 03.05.2018 um 20:43 schrieb Logan Gunthorpe:

On 03/05/18 11:29 AM, Christian KÃnig wrote:
Ok, that is the point where I'm stuck. Why do we need that in one
function call in the PCIe subsystem?

The problem at least with GPUs is that we seriously don't have that
information here, cause the PCI subsystem might not be aware of all the
interconnections.

For example it isn't uncommon to put multiple GPUs on one board. To the
PCI subsystem that looks like separate devices, but in reality all GPUs
are interconnected and can access each others memory directly without
going over the PCIe bus.

I seriously don't want to model that in the PCI subsystem, but rather
the driver. That's why it feels like a mistake to me to push all that
into the PCI function.
Huh? I'm lost. If you have a bunch of PCI devices you can send them as a
list to this API, if you want. If the driver is _sure_ they are all the
same, you only have to send one. In your terminology, you'd just have to
call the interface with:

pci_p2pdma_distance(target, [initiator, target])

Ok, I expected that something like that would do it.

So just to confirm: When I have a bunch of GPUs which could be the initiator I only need to do "pci_p2pdma_distance(target, [first GPU, target]);" and not "pci_p2pdma_distance(target, [first GPU, second GPU, third GPU, forth...., target])" ?

Why can't we model that as two separate transactions?
You could, but this is more convenient for users of the API that need to
deal with multiple devices (and manage devices that may be added or
removed at any time).

Are you sure that this is more convenient? At least on first glance it feels overly complicated.

I mean what's the difference between the two approaches?

ÂÂÂ sum = pci_p2pdma_distance(target, [A, B, C, target]);

and

ÂÂÂ sum = pci_p2pdma_distance(target, A);
ÂÂÂ sum += pci_p2pdma_distance(target, B);
ÂÂÂ sum += pci_p2pdma_distance(target, C);

Yeah, same for me. If Bjorn is ok with that specialized NVM functions
that I'm fine with that as well.

I think it would just be more convenient when we can come up with
functions which can handle all use cases, cause there still seems to be
a lot of similarities.
The way it's implemented is more general and can handle all use cases.
You are arguing for a function that can handle your case (albeit with a
bit more fuss) but can't handle mine and is therefore less general.
Calling my interface specialized is wrong.

Well at the end of the day you only need to convince Bjorn of the interface, so I'm perfectly fine with it as long as it serves my use case as well :)

But I still would like to understand your intention, cause that really helps not to accidentally break something in the long term.

Now when I take a look at the pure PCI hardware level, what I have is a transaction between an initiator and a target, and not multiple devices in one operation.

I mean you must have a very good reason that you now want to deal with multiple devices in the software layer, but neither from the code nor from your explanation that reason becomes obvious to me.

Thanks,
Christian.


Logan