Re: Enabling peer to peer device transactions for PCIe devices

From: Logan Gunthorpe
Date: Thu Nov 24 2016 - 13:11:45 EST

On 24/11/16 09:42 AM, Jason Gunthorpe wrote:
> There are three cases to worry about:
> - Coherent long lived page table mirroring (RDMA ODP MR)
> - Non-coherent long lived page table mirroring (RDMA MR)
> - Short lived DMA mapping (everything else)
> Like you say below we have to handle short lived in the usual way, and
> that covers basically every device except IB MRs, including the
> command queue on a NVMe drive.

Yes, this makes sense to me. Though I thought regular IB MRs with
regular memory currently pinned the pages (despite being long lived)
that's why we can run up against the "max locked memory" limit. It
doesn't seem so terrible if GPU memory had a similar restriction until
ODP like solutions get implemented.

>> Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
>> memory working for some time. I'd say it's a good fit. The main question
>> we've had is how to expose PCIe bars to userspace to be used as MRs and
>> such.

> Is there any progress on that?

Well, I guess there's some consensus building to do. The existing
options are:

* Device DAX: which could work but the problem I see with it is that it
only allows one application to do these transfers. Or there would have
to be some user-space coordination to figure which application gets what

* Regular DAX in the FS doesn't work at this time because the FS can
move the file you think your transfer to out from under you. Though I
understand there's been some work with XFS to solve that issue.

Though, we've been considering that the backed memory would be
non-volatile which adds some of this complexity. If the memory were
volatile the kernel would just need to do some relatively straight
forward allocation to user-space when asked. For example, with NVMe, the
kernel could give chunks of the CMB buffer to userspace via an mmap call
to /dev/nvmeX. Though I think there's been some push back against things
like that as well.

> I still don't quite get what iopmem was about.. I thought the
> objection to uncachable ZONE_DEVICE & DAX made sense, so running DAX
> over iopmem and still ending up with uncacheable mmaps still seems
> like a non-starter to me...

The latest incarnation of iopmem simply created a block device backed by
ZONE_DEVICE memory on a PCIe BAR. We then put a DAX FS on it and
user-space could mmap the files and send them to other devices to do P2P

I don't think there was a hard objection to uncachable ZONE_DEVICE and
DAX. We did try our experimental hardware with cached ZONE_DEVICE and it
did work but the performance was beyond unusable (which may be a
hardware issue). In the end I feel the driver would have to decide the
most appropriate caching for the hardware and I don't understand why WC
or UC wouldn't work with ZONE_DEVICE.