Re: [PATCH V3 0/4] Define coherent device memory node

From: Anshuman Khandual
Date: Thu Feb 23 2017 - 03:53:50 EST

On 02/22/2017 02:59 PM, Michal Hocko wrote:
> On Tue 21-02-17 18:39:17, Anshuman Khandual wrote:
>> On 02/17/2017 07:02 PM, Mel Gorman wrote:
> [...]
>>> Why can this not be expressed with cpusets and memory policies
>>> controlled by a combination of administrative steps for a privileged
>>> application and an application that is CDM aware?
>> Hmm, that can be done but having an in kernel infrastructure has the
>> following benefits.
>> * Administrator does not have to listen to node add notifications
>> and keep the isolation/allowed cpusets upto date all the time.
>> This can be a significant overhead on the admin/userspace which
>> have a number of separate device memory nodes.
> But the application has to communicate with the device so why it cannot
> use a device specific allocation as well? I really fail to see why
> something this special should hide behind a generic API to spread all
> the special casing into the kernel instead.

Eventually both memory as well as compute parts in a hybrid
CPU/device scheme should be implemented through generic API.
The scheduler should be able to take inputs from user space
to schedule a device compute specific function on a device
compute thread for a single run. Scheduler should then have
calls backs registered from the device which can be called
to schedule the function for a device compute. But right now
we are not there yet. So we are walking half the way and
trying to do it only for memory now.

>> * With cpuset solution, tasks which are part of CDM allowed cpuset
>> can have all it's VMAs allocate from CDM memory which may not be
>> something the user want. For example user may not want to have
>> the text segments, libraries allocate from CDM. To achieve this
>> the user will have to explicitly block allocation access from CDM
>> through mbind(MPOL_BIND) memory policy setups. This negative setup
>> is a big overhead. But with in kernel CDM framework, isolation is
>> enabled by default. For CDM allocations the application just has
>> to setup memory policy with CDM node in the allowed nodemask.
> Which makes cpusets vs. mempolicies even bigger mess, doesn't it? So say

Hence I am trying to defend CDM framework in comparison to cpuset
+ mbind() solution from user space as suggested by Mel.

> that you have an application which wants to benefit from CDM and use
> mbind to have an access to this memory for particular buffer. Now you
> try to run this application in a cpuset which doesn't include this node
> and now what? Cpuset will override the application policy so the buffer
> will never reach the requested node. At least not without even more
> hacks to cpuset handling. I really do not like that!

Right, it will not reach the CDM. The cpuset based solution was to
have the applications which want CDM in a CDM including cpuset and
all other applications/tasks in a CDM excluded cpuset. CDM aware
application can then set their own memory policies which *may*
include CDM and it will be allowed as their cpuset contain the
nodes. But these two cpusets once containing all CDM and one without
these CDMs should be maintained all the time. No kernel cpuset
hacks will be required.

> [...]
>> These are the reasons which prohibit the use of HMM for coherent
>> addressable device memory purpose.
> [...]
>> (3) Application cannot directly allocate into device memory from user
>> space using existing memory related system calls like mmap() and mbind()
>> as the device memory hides away in ZONE_DEVICE.
> Why cannot the application simply use mmap on the device file?

Yeah thats possible but then it does not go through core VM any more.

>> Apart from that, CDM framework provides a different approach to device
>> memory representation which does not require special device memory kind
>> of handling and associated call backs as implemented by HMM. It provides
>> NUMA node based visibility to the user space which can be extended to
>> support new features.
> What do you mean by new features and how users will use/request those
> features (aka what is the API)?

I dont have plans for this right now. But what I meant was once core
VM understand CDM even its represented as NUMA node, the existing APIs
can be be modified to accommodate special functions for CDM memory.