Re: Enabling peer to peer device transactions for PCIe devices

From: Jason Gunthorpe
Date: Wed Nov 23 2016 - 17:32:17 EST


On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote:
> > As I said, there is no possible special handling. Standard IB hardware
> > does not support changing the DMA address once a MR is created. Forget
> > about doing that.
>
> Yeah, that's essentially the point I was trying to make. Not to mention
> all the other unrelated hardware that can't DMA to an address that might
> disappear mid-transfer.

Right, it is impossible to ask for generic page migration with ongoing
DMA. That is simply not supported by any of the hardware at all.

> > Only ODP hardware allows changing the DMA address on the fly, and it
> > works at the page table level. We do not need special handling for
> > RDMA.
>
> I am aware of ODP but, noted by others, it doesn't provide a general
> solution to the points above.

How do you mean?

Perhaps I am not following what Serguei is asking for, but I
understood the desire was for a complex GPU allocator that could
migrate pages between GPU and CPU memory under control of the GPU
driver, among other things. The desire is for DMA to continue to work
even after these migrations happen.

Page table mirroring *is* the general solution for this problem. The
GPU driver controls the VMA and the DMA driver mirrors that VMA.

Do you know of another option that doesn't just degenerate to page
table mirroring??

Remember, there are two facets to the RDMA ODP implementation, I feel
there is some confusion here..

The crucial part for this discussion is the ability to fence and block
DMA for a specific range. This is the hardware capability that lets
page migration happen: fence&block DMA, migrate page, update page
table in HCA, unblock DMA.

Without that hardware support the DMA address must be unchanging, and
there is nothing we can do about it. This is why standard IB hardware
must have fixed MRs - it lacks the fence&block capability.

The other part is the page faulting implementation, but that is not
required, and to Serguei's point, is not desired for GPU anyhow.

> > To me this means at least items #1 and #3 should be removed from
> > Alexander's list.
>
> It's also worth noting that #4 makes use of ZONE_DEVICE (#2) so they are
> really the same option. iopmem is really just one way to get BAR
> addresses to user-space while inside the kernel it's ZONE_DEVICE.

Seems fine for RDMA?

Didn't we just strike off everything on the list except #2? :\

Jason