Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

From: Jerome Glisse
Date: Thu Jan 31 2019 - 14:35:22 EST


On Thu, Jan 31, 2019 at 07:02:15PM +0000, Jason Gunthorpe wrote:
> On Thu, Jan 31, 2019 at 09:13:55AM +0100, Christoph Hellwig wrote:
> > On Wed, Jan 30, 2019 at 03:52:13PM -0700, Logan Gunthorpe wrote:
> > > > *shrug* so what if the special GUP called a VMA op instead of
> > > > traversing the VMA PTEs today? Why does it really matter? It could
> > > > easily change to a struct page flow tomorrow..
> > >
> > > Well it's so that it's composable. We want the SGL->DMA side to work for
> > > APIs from kernel space and not have to run a completely different flow
> > > for kernel drivers than from userspace memory.
> >
> > Yes, I think that is the important point.
> >
> > All the other struct page discussion is not about anyone of us wanting
> > struct page - heck it is a pain to deal with, but then again it is
> > there for a reason.
> >
> > In the typical GUP flows we have three uses of a struct page:
> >
> > (1) to carry a physical address. This is mostly through
> > struct scatterlist and struct bio_vec. We could just store
> > a magic PFN-like value that encodes the physical address
> > and allow looking up a page if it exists, and we had at least
> > two attempts at it. In some way I think that would actually
> > make the interfaces cleaner, but Linus has NACKed it in the
> > past, so we'll have to convince him first that this is the
> > way forward
>
> Something like this (and more) has always been the roadblock with
> trying to mix BAR memory into SGL. I think it is such a big problem as
> to be unsolvable in one step..
>
> Struct page doesn't even really help anything beyond dma_map as we
> still can't pretend that __iomem is normal memory for general SGL
> users.
>
> > (2) to keep a reference to the memory so that it doesn't go away
> > under us due to swapping, process exit, unmapping, etc.
> > No idea how we want to solve this, but I guess you have
> > some smart ideas?
>
> Jerome, how does this work anyhow? Did you do something to make the
> VMA lifetime match the p2p_map/unmap? Or can we get into a situation
> were the VMA is destroyed and the importing driver can't call the
> unmap anymore?
>
> I know in the case of notifiers the VMA liftime should be strictly
> longer than the map/unmap - but does this mean we can never support
> non-notifier users via this scheme?

So in this version the requirement is that the importer also have a mmu
notifier registered and that's what all GPU driver do already. Any
driver that map some range of vma to a device should register itself as
a mmu notifier listener to do something when vma goes away. I posted a
patchset a while ago to allow listener to differentiate when the vma is
going away from other type of invalidation [1]

With that in place you can easily handle the pin case. Driver really
need to do something when the vma goes away with GUP or not. As the
device is then writing/reading to/from something that does not match
anything in the process address space.

So user that want pin would register notifier, call p2p_map with pin
flag and ignore all notifier callback except the unmap one when the
unmap one happens they have the vma and they should call p2p_unmap
from their invalidate callback and update their device to either some
dummy memory or program it in a way that the userspace application
will notice.

This can all be handled by some helper so that driver do not have to
write more than 5 lines of code and function to update their device
mapping to something of their choosing.


>
> > (3) to make the PTEs dirty after writing to them. Again no sure
> > what our preferred interface here would be
>
> This need doesn't really apply to BAR memory..
>
> > If we solve all of the above problems I'd be more than happy to
> > go with a non-struct page based interface for BAR P2P. But we'll
> > have to solve these issues in a generic way first.
>
> I still think the right direction is to build on what Logan has done -
> realize that he created a DMA-only SGL - make that a formal type of
> the kernel and provide the right set of APIs to work with this type,
> without being forced to expose struct page.
>
> Basically invert the API flow - the DMA map would be done close to
> GUP, not buried in the driver. This absolutely doesn't work for every
> flow we have, but it does enable the ones that people seem to care
> about when talking about P2P.

This does not work for GPU really i do not want to have to rewrite GPU
driver for this. Struct page is a burden and it does not bring anything
to the table. I rather provide an all in one stop for driver to use
this without having to worry between regular vma and special vma.

Note that in this patchset i reuse chunk of Logan works and intention is
to also allow PCI struct page to work too. But they should not be the
only mechanisms.

>
> To get to where we are today we'd need a few new IB APIs, and some
> nvme change to work with DMA-only SGL's and so forth, but that doesn't
> seem so bad. The API also seems much more safe and understandable than
> todays version that is trying to hope that the SGL is never touched by
> the CPU.
>
> It also does present a path to solve some cases of the O_DIRECT
> problems if the block stack can develop some way to know if an IO will
> go down a DMA-only IO path or not... This seems less challenging that
> auditing every SGL user for iomem safety??

So what is this O_DIRECT thing that keep coming again and again here :)
What is the use case ? Note that bio will always have valid struct page
of regular memory as using PCIE BAR for filesystem is crazy (you do not
have atomic or cache coherence and many CPU instruction have _undefined_
effect so what ever the userspace would do might do nothing.

Now if you want to use BAR address as destination or source of directIO
then let just update the directIO code to handle this. There is no need
to go hack every single place in the kernel that might deal with struct
page or sgl. Just update the place that need to understand this. We can
even update directIO to work on weird platform. The change to directIO
will be small, couple hundred line of code at best.

Cheers,
Jérôme

[1] https://lore.kernel.org/linux-fsdevel/20190123222315.1122-1-jglisse@xxxxxxxxxx/T/#m69e8f589240e18acbf196a1c8aa1d6fc97bd3565