Re: [RFC PATCH v2] mm: support multi-size THP numa balancing

From: Huang, Ying
Date: Mon Mar 18 2024 - 02:19:08 EST


Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> writes:

> Now the anonymous page allocation already supports multi-size THP (mTHP),
> but the numa balancing still prohibits mTHP migration even though it is an
> exclusive mapping, which is unreasonable. Thus let's support the exclusive
> mTHP numa balancing firstly.
>
> Allow scanning mTHP:
> Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section
> pages") skips shared CoW pages' NUMA page migration to avoid shared data
> segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to
> NUMA-migrate COW pages that have other uses") change to use page_count()
> to avoid GUP pages migration, that will also skip the mTHP numa scaning.
> Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
> issue, although there is still a GUP race, the issue seems to have been
> resolved by commit 80d47f5de5e3. Meanwhile, use the folio_estimated_sharers()
> to skip shared CoW pages though this is not a precise sharers count. To
> check if the folio is shared, ideally we want to make sure every page is
> mapped to the same process, but doing that seems expensive and using
> the estimated mapcount seems can work when running autonuma benchmark.
>
> Allow migrating mTHP:
> As mentioned in the previous thread[1], large folios are more susceptible
> to false sharing issues, leading to pages ping-pong back and forth during
> numa balancing, which is currently hard to resolve. Therefore, as a start to
> support mTHP numa balancing, only exclusive mappings are allowed to perform
> numa migration to avoid the false sharing issues with large folios. Similarly,
> use the estimated mapcount to skip shared mappings, which seems can work
> in most cases (?), and we've used folio_estimated_sharers() to skip shared
> mappings in migrate_misplaced_folio() for numa balancing, seems no real
> complaints.

IIUC, folio_estimated_sharers() cannot identify multi-thread
applications. If some mTHP is shared by multiple threads in one
process, how to deal with that?

For example, I think that we should avoid to migrate on the first fault
for mTHP in should_numa_migrate_memory().

More thoughts? Can we add a field in struct folio for mTHP to count
hint page faults from the same node?

--
Best Regards,
Huang, Ying

> Performance data:
> Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum
> Base: 2024-3-15 mm-unstable branch
> Enable mTHP=64K to run autonuma-benchmark
>
> Base without the patch:
> numa01
> 222.97
> numa01_THREAD_ALLOC
> 115.78
> numa02
> 13.04
> numa02_SMT
> 14.69
>
> Base with the patch:
> numa01
> 125.36
> numa01_THREAD_ALLOC
> 44.58
> numa02
> 9.22
> numa02_SMT
> 7.46
>
> [1] https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@xxxxxxxxxxxxxxxxxxx/
> Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
> ---
> Changes from RFC v1:
> - Add some preformance data per Huang, Ying.
> - Allow mTHP scanning per David Hildenbrand.
> - Avoid sharing mapping for numa balancing to avoid false sharing.
> - Add more commit message.
> ---
> mm/memory.c | 9 +++++----
> mm/mprotect.c | 3 ++-
> 2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index f2bc6dd15eb8..b9d5d88c5a76 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5059,7 +5059,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> int last_cpupid;
> int target_nid;
> pte_t pte, old_pte;
> - int flags = 0;
> + int flags = 0, nr_pages = 0;
>
> /*
> * The pte cannot be used safely until we verify, while holding the page
> @@ -5089,8 +5089,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> if (!folio || folio_is_zone_device(folio))
> goto out_map;
>
> - /* TODO: handle PTE-mapped THP */
> - if (folio_test_large(folio))
> + /* Avoid large folio false sharing */
> + if (folio_test_large(folio) && folio_estimated_sharers(folio) > 1)
> goto out_map;
>
> /*
> @@ -5112,6 +5112,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> flags |= TNF_SHARED;
>
> nid = folio_nid(folio);
> + nr_pages = folio_nr_pages(folio);
> /*
> * For memory tiering mode, cpupid of slow memory page is used
> * to record page access time. So use default value.
> @@ -5148,7 +5149,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>
> out:
> if (nid != NUMA_NO_NODE)
> - task_numa_fault(last_cpupid, nid, 1, flags);
> + task_numa_fault(last_cpupid, nid, nr_pages, flags);
> return 0;
> out_map:
> /*
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index f8a4544b4601..f0b9c974aaae 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb,
>
> /* Also skip shared copy-on-write pages */
> if (is_cow_mapping(vma->vm_flags) &&
> - folio_ref_count(folio) != 1)
> + (folio_maybe_dma_pinned(folio) ||
> + folio_estimated_sharers(folio) > 1))
> continue;
>
> /*