Re: [PATCH 2/2] mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask
From: Michal Hocko
Date: Mon Oct 22 2018 - 09:31:03 EST
On Mon 22-10-18 15:15:38, Vlastimil Babka wrote:
> On 9/26/18 4:22 PM, Michal Hocko wrote:
> > On Wed 26-09-18 16:17:08, Michal Hocko wrote:
> >> On Wed 26-09-18 16:30:39, Kirill A. Shutemov wrote:
> >>> On Tue, Sep 25, 2018 at 02:03:26PM +0200, Michal Hocko wrote:
> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>> index c3bc7e9c9a2a..c0bcede31930 100644
> >>>> --- a/mm/huge_memory.c
> >>>> +++ b/mm/huge_memory.c
> >>>> @@ -629,21 +629,40 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> >>>> * available
> >>>> * never: never stall for any thp allocation
> >>>> */
> >>>> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> >>>> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
> >>>> {
> >>>> const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> >>>> + gfp_t this_node = 0;
> >>>> +
> >>>> +#ifdef CONFIG_NUMA
> >>>> + struct mempolicy *pol;
> >>>> + /*
> >>>> + * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
> >>>> + * specified, to express a general desire to stay on the current
> >>>> + * node for optimistic allocation attempts. If the defrag mode
> >>>> + * and/or madvise hint requires the direct reclaim then we prefer
> >>>> + * to fallback to other node rather than node reclaim because that
> >>>> + * can lead to excessive reclaim even though there is free memory
> >>>> + * on other nodes. We expect that NUMA preferences are specified
> >>>> + * by memory policies.
> >>>> + */
> >>>> + pol = get_vma_policy(vma, addr);
> >>>> + if (pol->mode != MPOL_BIND)
> >>>> + this_node = __GFP_THISNODE;
> >>>> + mpol_cond_put(pol);
> >>>> +#endif
> >>>
> >>> I'm not very good with NUMA policies. Could you explain in more details how
> >>> the code above is equivalent to the code below?
> >>
> >> MPOL_PREFERRED is handled by policy_node() before we call __alloc_pages_nodemask.
> >> __GFP_THISNODE is applied only when we are not using
> >> __GFP_DIRECT_RECLAIM which is handled in alloc_hugepage_direct_gfpmask
> >> now.
> >> Lastly MPOL_BIND wasn't handled explicitly but in the end the removed
> >> late check would remove __GFP_THISNODE for it as well. So in the end we
> >> are doing the same thing unless I miss something
> >
> > Forgot to add. One notable exception would be that the previous code
> > would allow to hit
> > WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> > in policy_node if the requested node (e.g. cpu local one) was outside of
> > the mbind nodemask. This is not possible now. We haven't heard about any
> > such warning yet so it is unlikely that it happens though.
>
> I don't think the previous code could hit the warning, as the hugepage
> path that would add __GFP_THISNODE didn't call policy_node() (containing
> the warning) at all. IIRC early of your patch did hit the warning
> though, which is why you added the MPOL_BIND policy check.
Are you sure? What prevents node_isset(node, policy_nodemask()) == F and
fallback to the !huge allocation path? alloc_pages_vma is usually called
with the local node and processes shouldn't run off their bounded num
mask but is that guaranteed? Moreover do_huge_pmd_wp_page_fallback uses
the former numa binding and that might be outside of the policy mask.
In any case, as I've said this is highly unlikely to hit which is
underlined by the lack of reports.
--
Michal Hocko
SUSE Labs