Re: Enabling peer to peer device transactions for PCIe devices

From: Dan Williams
Date: Tue Nov 22 2016 - 16:21:55 EST


On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter <daniel@xxxxxxxx> wrote:
> On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
> <serguei.sagalovitch@xxxxxxx> wrote:
>>
>> On 2016-11-22 03:10 PM, Daniel Vetter wrote:
>>>
>>> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <dan.j.williams@xxxxxxxxx>
>>> wrote:
>>>>
>>>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
>>>> <serguei.sagalovitch@xxxxxxx> wrote:
>>>>>
>>>>> I personally like "device-DAX" idea but my concerns are:
>>>>>
>>>>> - How well it will co-exists with the DRM infrastructure /
>>>>> implementations
>>>>> in part dealing with CPU pointers?
>>>>
>>>> Inside the kernel a device-DAX range is "just memory" in the sense
>>>> that you can perform pfn_to_page() on it and issue I/O, but the vma is
>>>> not migratable. To be honest I do not know how well that co-exists
>>>> with drm infrastructure.
>>>>
>>>>> - How well we will be able to handle case when we need to
>>>>> "move"/"evict"
>>>>> memory/data to the new location so CPU pointer should point to the
>>>>> new
>>>>> physical location/address
>>>>> (and may be not in PCI device memory at all)?
>>>>
>>>> So, device-DAX deliberately avoids support for in-kernel migration or
>>>> overcommit. Those cases are left to the core mm or drm. The device-dax
>>>> interface is for cases where all that is needed is a direct-mapping to
>>>> a statically-allocated physical-address range be it persistent memory
>>>> or some other special reserved memory range.
>>>
>>> For some of the fancy use-cases (e.g. to be comparable to what HMM can
>>> pull off) I think we want all the magic in core mm, i.e. migration and
>>> overcommit. At least that seems to be the very strong drive in all
>>> general-purpose gpu abstractions and implementations, where memory is
>>> allocated with malloc, and then mapped/moved into vram/gpu address
>>> space through some magic,
>>
>> It is possible that there is other way around: memory is requested to be
>> allocated and should be kept in vram for performance reason but due
>> to possible overcommit case we need at least temporally to "move" such
>> allocation to system memory.
>
> With migration I meant migrating both ways of course. And with stuff
> like numactl we can also influence where exactly the malloc'ed memory
> is allocated originally, at least if we'd expose the vram range as a
> very special numa node that happens to be far away and not hold any
> cpu cores.

I don't think we should be using numa distance to reverse engineer a
certain allocation behavior. The latency data should be truthful, but
you're right we'll need a mechanism to keep general purpose
allocations out of that range by default. Btw, strict isolation is
another design point of device-dax, but I think in this case we're
describing something between the two extremes of full isolation and
full compatibility with existing numactl apis.