Re: [PATCH V3 0/4] Define coherent device memory node

From: Anshuman Khandual
Date: Mon Mar 06 2017 - 00:14:56 EST


On 02/23/2017 09:27 PM, Mel Gorman wrote:
> On Tue, Feb 21, 2017 at 06:39:17PM +0530, Anshuman Khandual wrote:
>>>>> In itself, the series does very little and as Vlastimil already pointed
>>>>> out, it's not a good idea to try merge piecemeal when people could not
>>>>> agree on the big picture (I didn't dig into it).
>>>>
>>>> With the proposed kernel changes and a associated driver its complete to
>>>> drive a user space based CPU/Device hybrid compute interchangeably on a
>>>> mmap() allocated memory buffer transparently and effectively.
>>>
>>> How is the device informed at that data is available for processing?
>>
>> It will through a call to the driver from user space which can take
>> the required buffer address as an argument.
>>
>
> Which goes back to tending towards what HMM intended but Jerome has covered
> all relevant point there so I won't repeat any of them in this response. It
> did sound that in part HMM was not used because it was missing some
> small steps which could have been included instead of proposing something
> different that did not meet their requirements but requires special casing.

Hmm. As I have mentioned in the HMM response thread from Jerome, at this
point we really dont know the amount of work required to get the missing
pieces of support like file mapping on all file systems, LRU support etc
with HMM done to start supporting ZONE_DEVICE based CDM implementation.
But lets discuss these things on the other thread what Balbir started.

>
>>> What prevents and application modifying the data on the device while it's
>>> being processed?
>>
>> Nothing in software. The application should take care of that but access
>> from both sides are coherent. It should wait for the device till it
>> finishes the compute it had asked for earlier to prevent override and
>> eventual corruption.
>>
>
> Which adds the caveat that applications must be fully CDM aware so if
> there are additional calls related to policies or administrative tasks
> for cpusets then it follows the application can also be aware of them.

Yeah, got your point.

>
>>> Why can this not be expressed with cpusets and memory policies
>>> controlled by a combination of administrative steps for a privileged
>>> application and an application that is CDM aware?
>>
>> Hmm, that can be done but having an in kernel infrastructure has the
>> following benefits.
>>
>> * Administrator does not have to listen to node add notifications
>> and keep the isolation/allowed cpusets upto date all the time.
>> This can be a significant overhead on the admin/userspace which
>> have a number of separate device memory nodes.
>>
>
> Could be handled with udev triggers potentially or if udev events are not
> raised by the memory hot-add then it could still be polled.

Okay.
>
>> * With cpuset solution, tasks which are part of CDM allowed cpuset
>> can have all it's VMAs allocate from CDM memory which may not be
>> something the user want. For example user may not want to have
>> the text segments, libraries allocate from CDM. To achieve this
>> the user will have to explicitly block allocation access from CDM
>> through mbind(MPOL_BIND) memory policy setups. This negative setup
>> is a big overhead. But with in kernel CDM framework, isolation is
>> enabled by default. For CDM allocations the application just has
>> to setup memory policy with CDM node in the allowed nodemask.
>>
>
> Then distinguish between task-wide policies that forbid CDM nodes and
> per-VMA policies that allow the CDM nodes. Migration between system

So application calls set_mempolicy() first do deny every one CDM nodes
(though cpuset allows them). Then it calls mbind() with CDM nodes for
buffer which it wants to have CDM memory ? Is that what you imply ?

> memory and devices remains a separate problem but migration would also
> not be covered by special casing the allocator.

If the CDM node is allowed for a VMA along with other system RAM nodes
then driver can migrate between them when ever it wants. What will be
the problem ?

>
>> Even with cpuset solution, applications still need to know which nodes
>> are CDM on the system at given point of time. So we will have to store
>> it in a nodemask and export them on sysfs some how.
>>
>
> Which in itself is not too bad and doesn't require special casing the
> allocator.

Right, the first patch in the series would do.

>
>>>
>>>> I had also
>>>> mentioned these points on the last posting in response to a comment from
>>>> Vlastimil.
>>>>
>>>> From this response (https://lkml.org/lkml/2017/2/14/50).
>>>>
>>>> * User space using mbind() to get CDM memory is an additional benefit
>>>> we get by making the CDM plug in as a node and be part of the buddy
>>>> allocator. But the over all idea from the user space point of view
>>>> is that the application can allocate any generic buffer and try to
>>>> use the buffer either from the CPU side or from the device without
>>>> knowing about where the buffer is really mapped physically. That
>>>> gives a seamless and transparent view to the user space where CPU
>>>> compute and possible device based compute can work together. This
>>>> is not possible through a driver allocated buffer.
>>>>
>>>
>>> Which can also be done with cpusets that prevents use of CDM memory and
>>> place all non-CDM processes into that cpuset with a separate cpuset for
>>> CDM-aware applications that allow access to CDM memory.
>>
>> Right, but with additional overheads as explained above.
>>
>
> The application must already be aware of the CDM nodes.

Absolutely.

>
>>>> * If any application is not using CDM memory for along time placed on
>>>> its buffer and another application is forced to fallback on system
>>>> RAM when it really wanted is CDM, the driver can detect these kind
>>>> of situations through memory access patterns on the device HW and
>>>> take necessary migration decisions.
>>>>
>>>> I hope this explains the rationale of the framework. In fact these
>>>> four patches give logically complete CPU/Device operating framework.
>>>> Other parts of the bigger picture are VMA management, KSM, Auto NUMA
>>>> etc which are improvements on top of this basic framework.
>>>>
>>>
>>> Automatic NUMA balancing is a particular oddity as that is about
>>> CPU->RAM locality and not RAM->device considerations.
>>
>> Right. But when there are migrations happening between system RAM and
>> device memory. Auto NUMA with its CPU fault information can migrate
>> between system RAM nodes which might not be necessary and can lead to
>> conflict or overhead. Hence Auto NUMA needs to be switched off at times
>> for the VMAs of concern but its not addressed in the patch series. As
>> mentioned before, it will be in the follow up work as improvements on
>> this series.
>
> Ensure the policy settings for CDM-backed VMAs do not set MPOL_F_MOF and
> automatic NUMA balancing will skip them. It does not require special casing
> of the allocator or specific CDM-awareness.

Though changes to NUMA automatic balancing was not proposed in this
series, will take this into account and look whether these policy
internal flags of the VMAs (which cannot be set from user space)
be set from the driver to make sure that the VMAs dont participate
in auto NUMA balancing.

>
>>> The memblock is to only avoid bootmem allocations from that area. It can
>>> be managed in the arch layer to first pass in all the system ram,
>>> teardown the bootmem allocator, setup the nodelists, set system
>>> nodemask, init CDM, init the allocator for that, and then optionally add
>>> it to the system CDM for userspace to do the isolation or provide.
>>>
>>> For that matter, the driver could do the discovery and then fake a
>>> memory hot-add.
>>
>> Not sure I got this correctly. Could you please explain more.
>>
>
> Discover the device, and online the memory later as memory hotplug generally
> does. If the faked memory hot-add operation raised an event that udev
> can detect then the administrative functions could also be triggered
> in userspace.

Yes, that is possible. On POWER though we dont support hot add of a
node, the node can be detected during boot as a memory less node and
then later on memory can be hot plugged into the node which can be
detected as an udev event as you have mentioned. As Balbir has started
another sub thread discussing merits and disadvantages of various
approaches, we will discuss them on that thread.