Re: [PATCH V3 0/4] Define coherent device memory node

From: Balbir Singh
Date: Mon Feb 20 2017 - 21:58:19 EST


On Fri, 2017-02-17 at 09:33 +0000, Mel Gorman wrote:
> On Fri, Feb 17, 2017 at 09:14:44AM +1100, Balbir Singh wrote:
>
>
> > On 16/02/17 05:20, Mel Gorman wrote:
> > > On Wed, Feb 15, 2017 at 05:37:22PM +0530, Anshuman Khandual wrote:
> > > >Â This four patches define CDM node with HugeTLB & Buddy allocation
> > > > isolation. Please refer to the last RFC posting mentioned here for more
> > >Â
> > > Always include the background with the changelog itself. Do not assume that
> > > people are willing to trawl through a load of past postings to assemble
> > > the picture. I'm only taking a brief look because of the page allocator
> > > impact but it does not appear that previous feedback was addressed.
> > >Â
> > > In itself, the series does very little and as Vlastimil already pointed
> > > out, it's not a good idea to try merge piecemeal when people could not
> > > agree on the big picture (I didn't dig into it).
> > >Â
>
> > The idea of CDM is independent of how some of the other problems related
> > to AutoNUMA balancing is handled.
>Â
> What has Automatic NUMA balancing got to do with CDM?
>Â

The idea is to have a policy to determine (based on the RFC discussion) whether
CDM nodes should participate in NUMA balancing.

> Even if you're trying to draw a comparison between how the patches were
> developed in comparison to CDM, it's a poor example. Regardless of which
> generation of NUMA balancing implementation considered (there were three
> contenders), each of them was a working implementation that had a measurable
> impact on a number of workloads. In many cases, performance data was
> included. The instructions on how workloads could use it were clear even
> if there were disagreements on exactly what the tuning options should be.
> While the feature evolved over time and improved for different classes of
> workload, the first set of patches merged were functional.
>Â
> > The idea of this patchset was to introduce
> > the concept of memory that is not necessarily system memory, but is coherent
> > in terms of visibility/access with some restrictions
>
>Â
> Which should be done without special casing the page allocator, cpusets and
> special casing how cpusets are handled. It's not necessary for any other
> mechanism used to restrict access to portions of memory such as cpusets,
> mempolicies or even memblock reservations.

Agreed, I mentioned a limitation that we see a cpusets. I do agree that
we should reuse any infrastructure we have, but cpusets are more static
in nature and inheritence compared to the requirements of CDM.

>Â
> > > The only reason I'm commenting at all is to say that I am extremely opposed
> > > to the changes made to the page allocator paths that are specific to
> > > CDM. It's been continual significant effort to keep the cost there down
> > > and this is a mess of special cases for CDM. The changes to hugetlb to
> > > identify "memory that is not really memory" with special casing is also
> > > quite horrible.
> > >Â
> > > It's completely unclear that even if one was to assume that CDM memory
> > > should be expressed as nodes why such systems do not isolate all processes
> > > from CDM nodes by default and then allow access via memory policies or
> > > cpusets instead of special casing the page allocator fast path. It's also
> > > completely unclear what happens if a device should then access the CDM
> > > and how that should be synchronised with the core, if that is even possible.
> > >Â
>
> > A big part of this is driven by the need to special case what allocations
> > go there. The idea being that an allocation should get there only when
> > explicitly requested.
>Â
> cpuset, mempolicy or mmap of a device file that mediates whether device
> or system memory is used. For the last option, I don't know the specifics
> but given that HMM worked on this for years, there should be ables of
> the considerations and complications that arise. I'm not familiar with
> the specifics.
>Â
> > Unfortunately, IIUC node distance is not a good
> > isolation metric.
>Â
> I don't recall suggesting that.

True, I am just saying :)

>Â
> > CPUsets are heavily driven by user space and we
> > believe that setting up CDM is not an administrative operation, its
> > going to be hard for an administrator or user space application to set
> > up the right policy or an installer to figure it out.
>Â
> So by this design, an application is expected to know nothing about how
> to access CDM yet be CDM-aware?Â

A higher layer abstracts what/where the memory is. The memory is coherent
(CDM), but for performance it may be migrated. In some special casesÂ
an aware application may request explicit allocation, in other cases an
unaware application may use it seemlessly and have its memory migrated.

The application is either aware of CDM or
> it isn't. It's either known how to access it or it does not.
>Â
> Even if it was a case that the arch layer provides hooks to alter the global
> nodemask and expose a special file of the CDM nodemask to userspace then
> it would still avoid special casing in the various allocators. It would
> not address the problem at all of how devices are meant to be informed
> that there is CDM memory with work to do but that has been raised elsewhere.
>Â
> > It does not help
> > that CPUSets assume inheritance from the root hierarchy. As far as the
> > overheads go, one could consider using STATIC_KEYS if that is worthwhile.
>
>Â
> Hiding the overhead in static keys could not change the fact that the various
> allocator paths should not need to be CDM-aware or special casing CDM when
> there already are existing mechanisms for avoiding regions of memory.
>


We don't want to hide things, but make it 0-overhead for non-users. There
might be better ways of doing it. Thanks for the review!

Balbir Singh.Â