Re: [RFC 4/4] mm: Ignore cpuset enforcement when allocation flag has __GFP_THISNODE

From: Anshuman Khandual
Date: Wed Nov 30 2016 - 06:17:30 EST


On 11/29/2016 10:22 PM, Dave Hansen wrote:
> On 11/28/2016 10:51 PM, Anshuman Khandual wrote:
>> On 11/29/2016 02:42 AM, Dave Hansen wrote:
>>>> On 11/22/2016 06:19 AM, Anshuman Khandual wrote:
>>>>>> --- a/mm/page_alloc.c
>>>>>> +++ b/mm/page_alloc.c
>>>>>> @@ -3715,7 +3715,7 @@ struct page *
>>>>>> .migratetype = gfpflags_to_migratetype(gfp_mask),
>>>>>> };
>>>>>>
>>>>>> - if (cpusets_enabled()) {
>>>>>> + if (cpusets_enabled() && !(alloc_mask & __GFP_THISNODE)) {
>>>>>> alloc_mask |= __GFP_HARDWALL;
>>>>>> alloc_flags |= ALLOC_CPUSET;
>>>>>> if (!ac.nodemask)
>>>>
>>>> This means now that any __GFP_THISNODE allocation can "escape" the
>>>> cpuset. That seems like a pretty major change to how cpusets works. Do
>>>> we know that *ALL* __GFP_THISNODE allocations are truly lacking in a
>>>> cpuset context that can be enforced?
>> Right, I know its a very blunt change. With the cpuset based isolation
>> of coherent device node for the user space tasks leads to a side effect
>> that a driver or even kernel cannot allocate memory from the coherent
> ...
>
> Well, we have __GFP_HARDWALL:
>
> * __GFP_HARDWALL enforces the cpuset memory allocation policy.
>
> which you can clear in the places where you want to do an allocation but
> want to ignore cpusets. But, __cpuset_node_allowed() looks like it gets
> a little funky if you do that since it would probably be falling back to
> the root cpuset that also would not have the new node in mems_allowed.

Right but what is the rationale behind this ? This what is in the in-code
documentation for this function __cpuset_node_allowed().

* GFP_KERNEL - any node in enclosing hardwalled cpuset ok

If the allocation has requested GFP_KERNEL, should not it look for the
entire system for memory ? Does cpuset still has to be enforced ?

>
> What exactly are the kernel-internal places that need to allocate from
> the coherent device node? When would this be done out of the context of
> an application *asking* for memory in the new node?

The primary user right now is a driver who wants to move around mapped
pages of an application from system RAM to CDM nodes and back. If the
application has requested for it though an ioctl(), during migration
the destination pages will be allocated on the CDM *in* the task context.

The driver could also have scheduled migration chunks in the work queue
which can execute later on. IIUC those execution and corresponding
allocation into CDM node will be *out* of context of the task.

Ideally looking for both the scenarios to work which dont right now.