Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

From: Anshuman Khandual
Date: Mon Jan 30 2017 - 20:36:45 EST

On 01/30/2017 11:04 PM, Dave Hansen wrote:
> On 01/29/2017 07:35 PM, Anshuman Khandual wrote:
>> * CDM node's zones are not part of any other node's FALLBACK zonelist
>> * CDM node's FALLBACK list contains it's own memory zones followed by
>> all system RAM zones in regular order as before
>> * CDM node's zones are part of it's own NOFALLBACK zonelist
> This seems like a sane policy for the system that you're describing.
> But, it's still a policy, and it's rather hard-coded into the kernel.

Right. In the original RFC which I had posted in October, I had thought
about this issue and created 'pglist_data->coherent_device' as a u64
element where each bit in the mask can indicate a specific policy request
for the hot plugged coherent device. But it looked too complicated in
for the moment in absence of other potential coherent memory HW which
really requires anything other than isolation and explicit allocation

> Let's say we had a CDM node with 100x more RAM than the rest of the
> system and it was just as fast as the rest of the RAM. Would we still
> want it isolated like this? Or would we want a different policy?

Though in this particular case this CDM can be hot plugged into the
system as a normal NUMA node (I dont see any reason why it should
not be treated as normal NUMA node) but I do understand the need
for different policy requirements for different kind of coherent

But then the other argument being, dont we want to keep this 100X more
memory isolated for some special purpose to be utilized by specific
applications ?

There is a sense that if the non system RAM memory is coherent and
similar there cannot be much differences between what they would
expect from the kernel.

> Why do we need this hard-coded along with the cpuset stuff later in the
> series. Doesn't taking a node out of the cpuset also take it out of the
> fallback lists?

There are two mutually exclusive approaches which are described in
this patch series.

(1) zonelist modification based approach
(2) cpuset restriction based approach

As mentioned in the cover letter,

NOTE: These two set of patches mutually exclusive of each other and
represent two different approaches. Only one of these sets should be
applied at any point of time.

mm: Change generic FALLBACK zonelist creation process
mm: Change mbind(MPOL_BIND) implementation for CDM nodes

cpuset: Add cpuset_inc() inside cpuset_init()
mm: Exclude CDM nodes from task->mems_allowed and root cpuset
mm: Ignore cpuset enforcement when allocation flag has __GFP_THISNODE

>> while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
>> + /*
>> + * CDM node's own zones should not be part of any other
>> + * node's fallback zonelist but only it's own fallback
>> + * zonelist.
>> + */
>> + if (is_cdm_node(node) && (pgdat->node_id != node))
>> + continue;
>> +#endif
> On a superficial note: Isn't that #ifdef unnecessary? is_cdm_node() has
> a 'return 0' stub when the config option is off anyway.

Right, will fix it up.