Re: [PATCH V3 0/4] Define coherent device memory node
From: Anshuman Khandual
Date: Fri Feb 17 2017 - 06:42:20 EST
On 02/15/2017 11:50 PM, Mel Gorman wrote:
> On Wed, Feb 15, 2017 at 05:37:22PM +0530, Anshuman Khandual wrote:
>> This four patches define CDM node with HugeTLB & Buddy allocation
>> isolation. Please refer to the last RFC posting mentioned here for more
> Always include the background with the changelog itself. Do not assume that
> people are willing to trawl through a load of past postings to assemble
> the picture. I'm only taking a brief look because of the page allocator
> impact but it does not appear that previous feedback was addressed.
Sure, I made a mistake. Will include the complete background from my
previous RFCs in the next version which will show the entire context
of this patch series. I have addressed the previous feedback regarding
cpuset enabled allocation leaks into CDM memory as pointed out by
Vlastimil Babka on the last version. Did I miss anything else inside
the Buddy allocator apart from that ?
> In itself, the series does very little and as Vlastimil already pointed
> out, it's not a good idea to try merge piecemeal when people could not
> agree on the big picture (I didn't dig into it).
With the proposed kernel changes and a associated driver its complete to
drive a user space based CPU/Device hybrid compute interchangeably on a
mmap() allocated memory buffer transparently and effectively. I had also
mentioned these points on the last posting in response to a comment from
>From this response (https://lkml.org/lkml/2017/2/14/50).
* User space using mbind() to get CDM memory is an additional benefit
we get by making the CDM plug in as a node and be part of the buddy
allocator. But the over all idea from the user space point of view
is that the application can allocate any generic buffer and try to
use the buffer either from the CPU side or from the device without
knowing about where the buffer is really mapped physically. That
gives a seamless and transparent view to the user space where CPU
compute and possible device based compute can work together. This
is not possible through a driver allocated buffer.
* The placement of the memory on the buffer can happen on system memory
when the CPU faults while accessing it. But a driver can manage the
migration between system RAM and CDM memory once the buffer is being
used from CPU and the device interchangeably. As you have mentioned
driver will have more information about where which part of the buffer
should be placed at any point of time and it can make it happen with
migration. So both allocation and placement are decided by the driver
during runtime. CDM provides the framework for this can kind device
assisted compute and driver managed memory placements.
* If any application is not using CDM memory for along time placed on
its buffer and another application is forced to fallback on system
RAM when it really wanted is CDM, the driver can detect these kind
of situations through memory access patterns on the device HW and
take necessary migration decisions.
I hope this explains the rationale of the framework. In fact these
four patches give logically complete CPU/Device operating framework.
Other parts of the bigger picture are VMA management, KSM, Auto NUMA
etc which are improvements on top of this basic framework.
> The only reason I'm commenting at all is to say that I am extremely opposed
> to the changes made to the page allocator paths that are specific to
> CDM. It's been continual significant effort to keep the cost there down
> and this is a mess of special cases for CDM. The changes to hugetlb to
> identify "memory that is not really memory" with special casing is also
> quite horrible.
We have already removed the O (n^2) search during zonelist iteration as
pointed out by Vlastimil and the current overhead is linear for the CDM
special case. We do similar checks for the cpuset function as well. Then
how is this horrible ? On HugeTLB, we isolate CDM based on a resultant
(MEMORY - CDM) node_states element which identifies system memory
instead of all of the accessible memory and keep the HugeTLB limited to
that nodemask. But if you feel there is any other better approach, we
can definitely try out.
> It's completely unclear that even if one was to assume that CDM memory
> should be expressed as nodes why such systems do not isolate all processes
> from CDM nodes by default and then allow access via memory policies or
> cpusets instead of special casing the page allocator fast path. It's also
> completely unclear what happens if a device should then access the CDM
> and how that should be synchronised with the core, if that is even possible.
I think Balbir has already commented on the cpuset part. Device and CPU
can consistently work on the common allocated buffer and HW takes care of
the access coherency.
> It's also unclear if this is even usable by an application in userspace
> at this point in time. If it is and the special casing is needed then the
Yeah with the current CDM approach its usable from user space as
> regions should be isolated from early mem allocations in the arch layer
> that is CDM aware, initialised late, and then setup userspace to isolate
> all but privileged applications from the CDM nodes. Do not litter the core
> with is_cdm_whatever checks.
I guess your are referring to allocating the entire CDM memory node with
memblock_reserve() and then arch managing the memory when user space
wants to use it through some sort of mmap, vm_ops methods. That defeats
the whole purpose of integrating CDM memory with core VM. I am afraid it
will also make migration between CDM memory and system memory difficult
which is essential in making the whole hybrid compute operation
transparent from the user space.
> At best this is incomplete because it does not look as if it could be used
> by anything properly and the fast path alterations are horrible even if
> it could be used. As it is, it should not be merged in my opinion.
I have mentioned in detail above how this much of code change enables
us to use the CDM in a transparent way from the user space. Please do
let me know if it still does not make sense, will try again.
On the fast path changes issue, I can really understand your concern
from the performance point of view as its achieved over a long time.
It would be great if you can suggest on how to improve from here.