Re: [RFC V2 03/12] mm: Change generic FALLBACK zonelist creation process

From: Anshuman Khandual
Date: Wed Feb 01 2017 - 01:41:18 EST


On 01/31/2017 07:27 AM, Dave Hansen wrote:
> On 01/30/2017 05:36 PM, Anshuman Khandual wrote:
>>> Let's say we had a CDM node with 100x more RAM than the rest of the
>>> system and it was just as fast as the rest of the RAM. Would we still
>>> want it isolated like this? Or would we want a different policy?
>>
>> But then the other argument being, dont we want to keep this 100X more
>> memory isolated for some special purpose to be utilized by specific
>> applications ?
>
> I was thinking that in this case, we wouldn't even want to bother with
> having "system RAM" in the fallback lists. A device who got its memory

System RAM is in the fallback list of the CDM node for the following
purpose.

If the user asks explicitly through mbind() and there is insufficient
memory on the CDM node to fulfill the request. Then it is better to
fallback on a system RAM memory node than to fail the request. This is
in line with expectations from the mbind() call. There are other ways
for the user space like /proc/pid/numa_maps to query about from where
exactly a given page has come from in the runtime.

But keeping options open I have noted down this in the cover letter.

"
FALLBACK zonelist creation:

CDM node's FALLBACK zonelist can also be changed to accommodate other CDM
memory zones along with system RAM zones in which case they can be used as
fallback options instead of first depending on the system RAM zones when
it's own memory falls insufficient during allocation.
"

> usage off by 1% could start to starve the rest of the system. A sane

Did not get this point. Could you please elaborate more on this ?

> policy in this case might be to isolate the "system RAM" from the device's.

Hmm.

>
>>> Why do we need this hard-coded along with the cpuset stuff later in the
>>> series. Doesn't taking a node out of the cpuset also take it out of the
>>> fallback lists?
>>
>> There are two mutually exclusive approaches which are described in
>> this patch series.
>>
>> (1) zonelist modification based approach
>> (2) cpuset restriction based approach
>>
>> As mentioned in the cover letter,
>
> Well, I'm glad you coded both of them up, but now that we have them how
> to we pick which one to throw to the wolves? Or, do we just merge both
> of them and let one bitrot? ;)

I am just trying to see how each isolation method stack up from benefit
and cost point of view, so that we can have informed debate about their
individual merit. Meanwhile I have started looking at if the core buddy
allocator __alloc_pages_nodemask() and its interaction with nodemask at
various stages can also be modified to implement the intended solution.