Re: [HMM v16 04/15] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory v2

From: Jerome Glisse
Date: Mon Jan 16 2017 - 21:00:54 EST


On Mon, Jan 16, 2017 at 04:58:24PM -0800, Dan Williams wrote:
> On Mon, Jan 16, 2017 at 12:13 PM, Jerome Glisse <jglisse@xxxxxxxxxx> wrote:
> > On Mon, Jan 16, 2017 at 11:31:39AM -0800, Dan Williams wrote:
> [..]
> >> >> dev_pagemap is only meant for get_user_pages() to do lookups of ptes
> >> >> with _PAGE_DEVMAP and take a reference against the hosting device..
> >> >
> >> > And i want to build on top of that to extend _PAGE_DEVMAP to support
> >> > a new usecase for unaddressable device memory.
> >> >
> >> >>
> >> >> Why can't HMM use the typical vm_operations_struct fault path and push
> >> >> more of these details to a driver rather than the core?
> >> >
> >> > Because the vm_operations_struct has nothing to do with the device.
> >> > We are talking about regular vma here. Think malloc, mmap, share
> >> > memory, ... not about mmap(/dev/thedevice,...)
> >> >
> >> > So the vm_operations_struct is never under device control and we can
> >> > not, nor want to, rely on that.
> >>
> >> Can you explain more what's behind that "can not, nor want to"
> >> statement? It seems to me that any awkwardness of moving to a
> >> standalone device file interface is less than a maintaining a new /
> >> parallel mm fault path through dev_pagemap.
> >
> > The whole point of HMM is to allow transparent usage of process address
> > space on to a device like GPU. So it imply any vma (vm_area_struct) that
> > result from usual mmap (ie any mmap either PRIVATE or SHARE as long as it
> > is not a an mmap of a device file).
> >
> > It means that application can use malloc or the usual memory allocation
> > primitive of the langage (c++, rust, python, ...) and directly use the
> > memory it gets from that with the device.
>
> So you need 100% support of all these mm paths for this hardware to be
> useful at all? Does a separate device-driver and a userpace helper
> library get you something like 80% of the functionality and then we
> can debate the core mm changes to get the final 20%? Or am I just
> completely off base with how people want to use this hardware?

Can't do that. Think library want to use GPU but you do not want to update
every single program that use that library and library get its memory from
the application. This is just one scenario. Then you have mmaped file, or
share memory, ...

Transparent address space is where the industry is moving and sadly on some
platform (like Intel) we can not rely on hardware to solve it for us.


> > Device like GPU have a large pool of device memory that is not accessible
> > by the CPU. This device memory has 10 times more bandwidth than system
> > memory and has better latency then PCIE. Hence for the whole thing to
> > make sense you need to allow to use it.
> >
> > For that you need to allow migration from system memory to device memory.
> > Because you can not rely on special userspace allocator you have to
> > assume that the vma (vm_area_struct) is a regular one. So we are left
> > with having struct page for the device memory to allow migration to
> > work without requiring too much changes to existing mm.
> >
> > Because device memory is not accessible by the CPU, you can not allow
> > anyone to pin it and thus get_user_page* must trigger a migration back
> > as CPU page fault would.
> >
> >
> >> > So what we looking for here is struct page that can behave mostly
> >> > like anyother except that we do not want to allow GUP to take a
> >> > reference almost exactly what ZONE_DEVICE already provide.
> >> >
> >> > So do you have any fundamental objections to this patchset ? And if
> >> > so, how do you propose i solve the problem i am trying to address ?
> >> > Because hardware exist today and without something like HMM we will
> >> > not be able to support such hardware.
> >>
> >> My pushback stems from it being a completely different use case for
> >> devm_memremap_pages(), as evidenced by it growing from 4 arguments to
> >> 9, and the ongoing maintenance overhead of understanding HMM
> >> requirements when updating the pmem usage of ZONE_DEVICE.
> >
> > I rather reuse something existing and modify it to support more use case
> > than try to add ZONE_DEVICE2 or ZONE_DEVICE_I_AM_DIFFERENT. I have made
> > sure that my modifications to ZONE_DEVICE can be use without HMM. It is
> > just a generic interface to support page fault and to allow to track last
> > user of a device page. Both can be use indepentently from each other.
> >
> > To me the whole point of kernel is trying to share infrastructure accross
> > as many hardware as possible and i am doing just that. I do not think HMM
> > should be block because something that use to be for one specific use case
> > now support 2 use cases. I am not breaking anything existing. Is it more
> > work for you ? Maybe, but at Red Hat we intend to support it for as long
> > as it is needed so you always have some one to talk to if you want to
> > update ZONE_DEVICE.
>
> Sharing infrastructure should not come at the expense of type safety
> and clear usage rules.

And where exactly do i violate that ?

> For example the pmem case, before exposing ZONE_DEVICE memory to other
> parts of the kernel, introduced the pfn_t type to distinguish DMA
> capable pfns from other raw pfns. All programmatic ways of discovering
> if a pmem range can support DMA use this type and explicit flags.

I am protected from this because i do not allow GUP. GUP trigger migration
back to regular system memory.

>
> While we may not need ZONE_DEVICE2 we obviously need a different
> wrapper around arch_add_memory() than devm_memremap_pages() for HMM
> and likely a different physical address radix than pgmap_radix because
> they are servicing 2 distinct purposes. For example, I don't think HMM
> should be using unmodified arch_add_memory(). We shouldn't add
> unaddressable memory to the linear address mappings when we know there
> is nothing behind it, especially when it seems all you need from
> arch_add_memory() is pfn_to_page() to be valid.

And my patchset does just that, i do not add the device pfn to the linear
mapping because there is nothing there. In arch_add_memory() x86, ppc, arm
do barely more than setting up linear mapping and adding struct page. So
instead of splitting in two this function i just made the linear mapping
conditional.

I can split HMM from devm_memremap_pages() and thus from a different
pgmap_radix. You have to understand that this will not change most of
my patchset.


Cheers,
Jérôme