Re: [PATCH v3 09/14] mm: thp: check pmd migration entry in common path

From: Zi Yan
Date: Thu Feb 09 2017 - 12:37:02 EST


On 9 Feb 2017, at 3:16, Naoya Horiguchi wrote:

> On Sun, Feb 05, 2017 at 11:12:47AM -0500, Zi Yan wrote:
>> From: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>
>>
>> If one of callers of page migration starts to handle thp,
>> memory management code start to see pmd migration entry, so we need
>> to prepare for it before enabling. This patch changes various code
>> point which checks the status of given pmds in order to prevent race
>> between thp migration and the pmd-related works.
>>
>> ChangeLog v1 -> v2:
>> - introduce pmd_related() (I know the naming is not good, but can't
>> think up no better name. Any suggesntion is welcomed.)
>>
>> Signed-off-by: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>
>>
>> ChangeLog v2 -> v3:
>> - add is_swap_pmd()
>> - a pmd entry should be is_swap_pmd(), pmd_trans_huge(), pmd_devmap(),
>> or pmd_none()
>
> (nitpick) ... or normal pmd pointing to pte pages?

Sure, I will add it.

>
>> - use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear()
>> - flush_cache_range() while set_pmd_migration_entry()
>> - pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
>> true on pmd_migration_entry, so that migration entries are not
>> treated as pmd page table entries.
>>
>> Signed-off-by: Zi Yan <zi.yan@xxxxxxxxxxxxxx>
>> ---
>> arch/x86/mm/gup.c | 4 +--
>> fs/proc/task_mmu.c | 22 ++++++++-----
>> include/asm-generic/pgtable.h | 71 ----------------------------------------
>> include/linux/huge_mm.h | 21 ++++++++++--
>> include/linux/swapops.h | 74 +++++++++++++++++++++++++++++++++++++++++
>> mm/gup.c | 20 ++++++++++--
>> mm/huge_memory.c | 76 ++++++++++++++++++++++++++++++++++++-------
>> mm/madvise.c | 2 ++
>> mm/memcontrol.c | 2 ++
>> mm/memory.c | 9 +++--
>> mm/memory_hotplug.c | 13 +++++++-
>> mm/mempolicy.c | 1 +
>> mm/mprotect.c | 6 ++--
>> mm/mremap.c | 2 +-
>> mm/pagewalk.c | 2 ++
>> 15 files changed, 221 insertions(+), 104 deletions(-)
>>
>> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
>> index 0d4fb3ebbbac..78a153d90064 100644
>> --- a/arch/x86/mm/gup.c
>> +++ b/arch/x86/mm/gup.c
>> @@ -222,9 +222,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>> pmd_t pmd = *pmdp;
>>
>> next = pmd_addr_end(addr, end);
>> - if (pmd_none(pmd))
>> + if (!pmd_present(pmd))
>> return 0;
>> - if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
>> + if (unlikely(pmd_large(pmd))) {
>> /*
>> * NUMA hinting faults need to be handled in the GUP
>> * slowpath for accounting purposes and so that they
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 6c07c7813b26..1e64d6898c68 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>>
>> ptl = pmd_trans_huge_lock(pmd, vma);
>> if (ptl) {
>> - smaps_pmd_entry(pmd, addr, walk);
>> + if (pmd_present(*pmd))
>> + smaps_pmd_entry(pmd, addr, walk);
>> spin_unlock(ptl);
>> return 0;
>> }
>> @@ -929,6 +930,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>> goto out;
>> }
>>
>> + if (!pmd_present(*pmd))
>> + goto out;
>> +
>> page = pmd_page(*pmd);
>>
>> /* Clear accessed and referenced bits. */
>> @@ -1208,19 +1212,19 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
>> if (ptl) {
>> u64 flags = 0, frame = 0;
>> pmd_t pmd = *pmdp;
>> + struct page *page;
>>
>> if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
>> flags |= PM_SOFT_DIRTY;
>>
>> - /*
>> - * Currently pmd for thp is always present because thp
>> - * can not be swapped-out, migrated, or HWPOISONed
>> - * (split in such cases instead.)
>> - * This if-check is just to prepare for future implementation.
>> - */
>> - if (pmd_present(pmd)) {
>> - struct page *page = pmd_page(pmd);
>> + if (is_pmd_migration_entry(pmd)) {
>> + swp_entry_t entry = pmd_to_swp_entry(pmd);
>>
>> + frame = swp_type(entry) |
>> + (swp_offset(entry) << MAX_SWAPFILES_SHIFT);
>> + page = migration_entry_to_page(entry);
>> + } else if (pmd_present(pmd)) {
>> + page = pmd_page(pmd);
>> if (page_mapcount(page) == 1)
>> flags |= PM_MMAP_EXCLUSIVE;
>>
>> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
>> index b71a431ed649..6cf9e9b5a7be 100644
>> --- a/include/asm-generic/pgtable.h
>> +++ b/include/asm-generic/pgtable.h
>> @@ -726,77 +726,6 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
>> #ifndef arch_needs_pgtable_deposit
>> #define arch_needs_pgtable_deposit() (false)
>> #endif
>> -/*
>> - * This function is meant to be used by sites walking pagetables with
>> - * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
>> - * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd
>> - * into a null pmd and the transhuge page fault can convert a null pmd
>> - * into an hugepmd or into a regular pmd (if the hugepage allocation
>> - * fails). While holding the mmap_sem in read mode the pmd becomes
>> - * stable and stops changing under us only if it's not null and not a
>> - * transhuge pmd. When those races occurs and this function makes a
>> - * difference vs the standard pmd_none_or_clear_bad, the result is
>> - * undefined so behaving like if the pmd was none is safe (because it
>> - * can return none anyway). The compiler level barrier() is critically
>> - * important to compute the two checks atomically on the same pmdval.
>> - *
>> - * For 32bit kernels with a 64bit large pmd_t this automatically takes
>> - * care of reading the pmd atomically to avoid SMP race conditions
>> - * against pmd_populate() when the mmap_sem is hold for reading by the
>> - * caller (a special atomic read not done by "gcc" as in the generic
>> - * version above, is also needed when THP is disabled because the page
>> - * fault can populate the pmd from under us).
>> - */
>> -static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
>> -{
>> - pmd_t pmdval = pmd_read_atomic(pmd);
>> - /*
>> - * The barrier will stabilize the pmdval in a register or on
>> - * the stack so that it will stop changing under the code.
>> - *
>> - * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
>> - * pmd_read_atomic is allowed to return a not atomic pmdval
>> - * (for example pointing to an hugepage that has never been
>> - * mapped in the pmd). The below checks will only care about
>> - * the low part of the pmd with 32bit PAE x86 anyway, with the
>> - * exception of pmd_none(). So the important thing is that if
>> - * the low part of the pmd is found null, the high part will
>> - * be also null or the pmd_none() check below would be
>> - * confused.
>> - */
>> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> - barrier();
>> -#endif
>> - if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
>> - return 1;
>> - if (unlikely(pmd_bad(pmdval))) {
>> - pmd_clear_bad(pmd);
>> - return 1;
>> - }
>> - return 0;
>> -}
>> -
>> -/*
>> - * This is a noop if Transparent Hugepage Support is not built into
>> - * the kernel. Otherwise it is equivalent to
>> - * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
>> - * places that already verified the pmd is not none and they want to
>> - * walk ptes while holding the mmap sem in read mode (write mode don't
>> - * need this). If THP is not enabled, the pmd can't go away under the
>> - * code even if MADV_DONTNEED runs, but if THP is enabled we need to
>> - * run a pmd_trans_unstable before walking the ptes after
>> - * split_huge_page_pmd returns (because it may have run when the pmd
>> - * become null, but then a page fault can map in a THP and not a
>> - * regular page).
>> - */
>> -static inline int pmd_trans_unstable(pmd_t *pmd)
>> -{
>> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> - return pmd_none_or_trans_huge_or_clear_bad(pmd);
>> -#else
>> - return 0;
>> -#endif
>> -}
>>
>> #ifndef CONFIG_NUMA_BALANCING
>> /*
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 83a8d42f9d55..c2e5a4eab84a 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -131,7 +131,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>> #define split_huge_pmd(__vma, __pmd, __address) \
>> do { \
>> pmd_t *____pmd = (__pmd); \
>> - if (pmd_trans_huge(*____pmd) \
>> + if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd) \
>> || pmd_devmap(*____pmd)) \
>> __split_huge_pmd(__vma, __pmd, __address, \
>> false, NULL); \
>> @@ -162,12 +162,18 @@ extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
>> struct vm_area_struct *vma);
>> extern spinlock_t *__pud_trans_huge_lock(pud_t *pud,
>> struct vm_area_struct *vma);
>> +
>> +static inline int is_swap_pmd(pmd_t pmd)
>> +{
>> + return !pmd_none(pmd) && !pmd_present(pmd);
>> +}
>> +
>> /* mmap_sem must be held on entry */
>> static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
>> struct vm_area_struct *vma)
>> {
>> VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
>> - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
>> + if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
>> return __pmd_trans_huge_lock(pmd, vma);
>> else
>> return NULL;
>> @@ -192,6 +198,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>> pmd_t *pmd, int flags);
>> struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>> pud_t *pud, int flags);
>> +static inline int hpage_order(struct page *page)
>> +{
>> + if (unlikely(PageTransHuge(page)))
>> + return HPAGE_PMD_ORDER;
>> + return 0;
>> +}
>>
>> extern int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
>>
>> @@ -232,6 +244,7 @@ static inline bool thp_migration_supported(void)
>> #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
>>
>> #define hpage_nr_pages(x) 1
>> +#define hpage_order(x) 0
>>
>> #define transparent_hugepage_enabled(__vma) 0
>>
>> @@ -274,6 +287,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
>> long adjust_next)
>> {
>> }
>> +static inline int is_swap_pmd(pmd_t pmd)
>> +{
>> + return 0;
>> +}
>> static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
>> struct vm_area_struct *vma)
>> {
>> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
>> index 6625bea13869..50e4aa7e7ff9 100644
>> --- a/include/linux/swapops.h
>> +++ b/include/linux/swapops.h
>> @@ -229,6 +229,80 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>> }
>> #endif
>>
>> +/*
>> + * This function is meant to be used by sites walking pagetables with
>> + * the mmap_sem hold in read mode to protect against MADV_DONTNEED and
>> + * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd
>> + * into a null pmd and the transhuge page fault can convert a null pmd
>> + * into an hugepmd or into a regular pmd (if the hugepage allocation
>> + * fails). While holding the mmap_sem in read mode the pmd becomes
>> + * stable and stops changing under us only if it's not null and not a
>> + * transhuge pmd. When those races occurs and this function makes a
>> + * difference vs the standard pmd_none_or_clear_bad, the result is
>> + * undefined so behaving like if the pmd was none is safe (because it
>> + * can return none anyway). The compiler level barrier() is critically
>> + * important to compute the two checks atomically on the same pmdval.
>> + *
>> + * For 32bit kernels with a 64bit large pmd_t this automatically takes
>> + * care of reading the pmd atomically to avoid SMP race conditions
>> + * against pmd_populate() when the mmap_sem is hold for reading by the
>> + * caller (a special atomic read not done by "gcc" as in the generic
>> + * version above, is also needed when THP is disabled because the page
>> + * fault can populate the pmd from under us).
>> + */
>> +static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
>> +{
>> + pmd_t pmdval = pmd_read_atomic(pmd);
>> + /*
>> + * The barrier will stabilize the pmdval in a register or on
>> + * the stack so that it will stop changing under the code.
>> + *
>> + * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
>> + * pmd_read_atomic is allowed to return a not atomic pmdval
>> + * (for example pointing to an hugepage that has never been
>> + * mapped in the pmd). The below checks will only care about
>> + * the low part of the pmd with 32bit PAE x86 anyway, with the
>> + * exception of pmd_none(). So the important thing is that if
>> + * the low part of the pmd is found null, the high part will
>> + * be also null or the pmd_none() check below would be
>> + * confused.
>> + */
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> + barrier();
>> +#endif
>> + if (pmd_none(pmdval) || pmd_trans_huge(pmdval)
>> + || is_pmd_migration_entry(pmdval))
>> + return 1;
>> + if (unlikely(pmd_bad(pmdval))) {
>> + pmd_clear_bad(pmd);
>> + return 1;
>> + }
>> + return 0;
>> +}
>> +
>> +/*
>> + * This is a noop if Transparent Hugepage Support is not built into
>> + * the kernel. Otherwise it is equivalent to
>> + * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
>> + * places that already verified the pmd is not none and they want to
>> + * walk ptes while holding the mmap sem in read mode (write mode don't
>> + * need this). If THP is not enabled, the pmd can't go away under the
>> + * code even if MADV_DONTNEED runs, but if THP is enabled we need to
>> + * run a pmd_trans_unstable before walking the ptes after
>> + * split_huge_page_pmd returns (because it may have run when the pmd
>> + * become null, but then a page fault can map in a THP and not a
>> + * regular page).
>> + */
>> +static inline int pmd_trans_unstable(pmd_t *pmd)
>> +{
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> + return pmd_none_or_trans_huge_or_clear_bad(pmd);
>> +#else
>> + return 0;
>> +#endif
>> +}
>> +
>> +
>
> These functions are page table or thp matter, so putting them in swapops.h
> looks weird to me. Maybe you can avoid this code transfer by using !pmd_present
> instead of is_pmd_migration_entry?
> And we have to consider renaming pmd_none_or_trans_huge_or_clear_bad(),
> I like a simple name like __pmd_trans_unstable(), but if you have an idea,
> that's great.

Yes. I will move it back.

I am not sure if it is OK to only use !pmd_present. We may miss some pmd_bad.

Kirill and Andrea, can you give some insight on this?


>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 19b460acb5e1..9cb4c83151a8 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>
> Changes on mm/memory_hotplug.c should be with patch 14/14?
> # If that's right, definition of hpage_order() also should go to 14/14.

Got it. I will move it.


--
Best Regards
Yan Zi

Attachment: signature.asc
Description: OpenPGP digital signature