Re: Enabling peer to peer device transactions for PCIe devices

From: Dan Williams
Date: Wed Nov 23 2016 - 17:42:30 EST


On Wed, Nov 23, 2016 at 1:55 PM, Jason Gunthorpe
<jgunthorpe@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote:
>> > As I said, there is no possible special handling. Standard IB hardware
>> > does not support changing the DMA address once a MR is created. Forget
>> > about doing that.
>>
>> Yeah, that's essentially the point I was trying to make. Not to mention
>> all the other unrelated hardware that can't DMA to an address that might
>> disappear mid-transfer.
>
> Right, it is impossible to ask for generic page migration with ongoing
> DMA. That is simply not supported by any of the hardware at all.
>
>> > Only ODP hardware allows changing the DMA address on the fly, and it
>> > works at the page table level. We do not need special handling for
>> > RDMA.
>>
>> I am aware of ODP but, noted by others, it doesn't provide a general
>> solution to the points above.
>
> How do you mean?
>
> Perhaps I am not following what Serguei is asking for, but I
> understood the desire was for a complex GPU allocator that could
> migrate pages between GPU and CPU memory under control of the GPU
> driver, among other things. The desire is for DMA to continue to work
> even after these migrations happen.
>
> Page table mirroring *is* the general solution for this problem. The
> GPU driver controls the VMA and the DMA driver mirrors that VMA.
>
> Do you know of another option that doesn't just degenerate to page
> table mirroring??
>
> Remember, there are two facets to the RDMA ODP implementation, I feel
> there is some confusion here..
>
> The crucial part for this discussion is the ability to fence and block
> DMA for a specific range. This is the hardware capability that lets
> page migration happen: fence&block DMA, migrate page, update page
> table in HCA, unblock DMA.

Wait, ODP requires migratable pages, ZONE_DEVICE pages are not
migratable. You can't replace a PCIe mapping with just any other
System RAM physical address, right? At least not without a filesystem
recording where things went, but at point we're no longer talking
about the base P2P-DMA mapping mechanism and are instead talking about
something like pnfs-rdma to a DAX filesystem.