Re: [PATCH 2/2] mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask
From: Zi Yan
Date: Thu Oct 04 2018 - 17:51:47 EST
On 4 Oct 2018, at 16:17, David Rientjes wrote:
> On Wed, 26 Sep 2018, Kirill A. Shutemov wrote:
>
>> On Tue, Sep 25, 2018 at 02:03:26PM +0200, Michal Hocko wrote:
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index c3bc7e9c9a2a..c0bcede31930 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -629,21 +629,40 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>>> * available
>>> * never: never stall for any thp allocation
>>> */
>>> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>>> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>>> {
>>> const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>>> + gfp_t this_node = 0;
>>> +
>>> +#ifdef CONFIG_NUMA
>>> + struct mempolicy *pol;
>>> + /*
>>> + * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
>>> + * specified, to express a general desire to stay on the current
>>> + * node for optimistic allocation attempts. If the defrag mode
>>> + * and/or madvise hint requires the direct reclaim then we prefer
>>> + * to fallback to other node rather than node reclaim because that
>>> + * can lead to excessive reclaim even though there is free memory
>>> + * on other nodes. We expect that NUMA preferences are specified
>>> + * by memory policies.
>>> + */
>>> + pol = get_vma_policy(vma, addr);
>>> + if (pol->mode != MPOL_BIND)
>>> + this_node = __GFP_THISNODE;
>>> + mpol_cond_put(pol);
>>> +#endif
>>
>> I'm not very good with NUMA policies. Could you explain in more details how
>> the code above is equivalent to the code below?
>>
>
> It breaks mbind() because new_page() is now using numa_node_id() to
> allocate migration targets for instead of using the mempolicy. I'm not
> sure that this patch was tested for mbind().
I do not see mbind() is broken. With both patches applied, I ran
"numactl -N 0 memhog -r1 4096m membind 1" and saw all pages are allocated
in Node 1 not Node 0, which is returned by numa_node_id().
From the source code, in alloc_pages_vma(), the nodemask is generated
from the memory policy (i.e. mbind in the case above), which only has
the nodes specified by mbind(). Then, __alloc_pages_nodemask() only uses
the zones from the nodemask. The numa_node_id() return value will be
ignored in the actual page allocation process if mbind policy is applied.
Let me know if I miss anything.
--
Best Regards
Yan Zi
Attachment:
signature.asc
Description: OpenPGP digital signature