Re: [PATCH 06/12] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()

From: Christoph Hellwig
Date: Mon Jan 08 2018 - 13:34:42 EST


On Mon, Jan 08, 2018 at 11:09:17AM -0700, Jason Gunthorpe wrote:
> > As usual we implement what actually has a consumer. On top of that the
> > R/W API is the only core RDMA API that actually does DMA mapping for the
> > ULP at the moment.
>
> Well again the same can be said for dma_map_page vs dma_map_sg...

I don't understand this comment.

>
> > For SENDs and everything else dma maps are done by the ULP (I'd like
> > to eventually change that, though - e.g. sends through that are
> > inline to the workqueue don't need a dma map to start with).
>
>
> > That's because the initial design was to let the ULPs do the DMA
> > mappings, which fundamentally is wrong. I've fixed it for the R/W
> > API when adding it, but no one has started work on SENDs and atomics.
>
> Well, you know why it is like this, and it is very complicated to
> unwind - the HW driver does not have enough information during CQ
> processing to properly do any unmaps, let alone serious error tear
> down unmaps, so we'd need a bunch of new APIs developed first, like RW
> did. :\

Yes, if it was trivial we would have done it already.

> > > And on that topic, does this scheme work with HFI?
> >
> > No, and I guess we need an opt-out. HFI generally seems to be
> > extremely weird.
>
> This series needs some kind of fix so HFI, QIB, rxe, etc don't get
> broken, and it shouldn't be 'fixed' at the RDMA level.

I don't think rxe is a problem as it won't show up a pci device.
HFI and QIB do show as PCI devices, and could be used for P2P transfers
from the PCI point of view. It's just that they have a layer of
software indirection between their hardware and what is exposed at
the RDMA layer.

So I very much disagree about where to place that workaround - the
RDMA code is exactly the right place.

> > > This is why P2P must fit in to the common DMA framework somehow, we
> > > rely on these abstractions to work properly and fully in RDMA.
> >
> > Moving P2P up to common RDMA code isn't going to fix this. For that
> > we need to stop preting that something that isn't DMA can abuse the
> > dma mapping framework, and until then opt them out of behavior that
> > assumes actual DMA like P2P.
>
> It could, if we had a DMA op for p2p then the drivers that provide
> their own ops can implement it appropriately or not at all.
>
> Eg the correct implementation for rxe to support p2p memory is
> probably somewhat straightfoward.

But P2P is _not_ a factor of the dma_ops implementation at all,
it is something that happens behind the dma_map implementation.

Think about what the dma mapping routines do:

(a) translate from host address to bus addresses

and

(b) flush caches (in non-coherent architectures)

Both are obviously not needed for P2P transfers, as they never reach
the host.

> Very long term the IOMMUs under the ops will need to care about this,
> so the wrapper is not an optimal place to put it - but I wouldn't
> object if it gets it out of RDMA :)

Unless you have an IOMMU on your PCIe switch and not before/inside
the root complex that is not correct.