Re: [RFC PATCH v2] mm: support multi-size THP numa balancing

From: David Hildenbrand
Date: Mon Mar 18 2024 - 06:16:14 EST


On 18.03.24 11:13, Baolin Wang wrote:


On 2024/3/18 17:48, David Hildenbrand wrote:
On 18.03.24 10:42, Baolin Wang wrote:


On 2024/3/18 14:16, Huang, Ying wrote:
Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> writes:

Now the anonymous page allocation already supports multi-size THP
(mTHP),
but the numa balancing still prohibits mTHP migration even though it
is an
exclusive mapping, which is unreasonable. Thus let's support the
exclusive
mTHP numa balancing firstly.

Allow scanning mTHP:
Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data
section
pages") skips shared CoW pages' NUMA page migration to avoid shared
data
segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to
NUMA-migrate COW pages that have other uses") change to use
page_count()
to avoid GUP pages migration, that will also skip the mTHP numa
scaning.
Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
issue, although there is still a GUP race, the issue seems to have been
resolved by commit 80d47f5de5e3. Meanwhile, use the
folio_estimated_sharers()
to skip shared CoW pages though this is not a precise sharers count. To
check if the folio is shared, ideally we want to make sure every
page is
mapped to the same process, but doing that seems expensive and using
the estimated mapcount seems can work when running autonuma benchmark.

Allow migrating mTHP:
As mentioned in the previous thread[1], large folios are more
susceptible
to false sharing issues, leading to pages ping-pong back and forth
during
numa balancing, which is currently hard to resolve. Therefore, as a
start to
support mTHP numa balancing, only exclusive mappings are allowed to
perform
numa migration to avoid the false sharing issues with large folios.
Similarly,
use the estimated mapcount to skip shared mappings, which seems can
work
in most cases (?), and we've used folio_estimated_sharers() to skip
shared
mappings in migrate_misplaced_folio() for numa balancing, seems no real
complaints.

IIUC, folio_estimated_sharers() cannot identify multi-thread
applications.  If some mTHP is shared by multiple threads in one

Right.


Wasn't this "false sharing" previously raised/described by Mel in this
context?

Yes, I got confused with the process's false sharing.

process, how to deal with that?

IMHO, seems the should_numa_migrate_memory() already did something to
help?

......
    if (!cpupid_pid_unset(last_cpupid) &&
                cpupid_to_nid(last_cpupid) != dst_nid)
        return false;

    /* Always allow migrate on private faults */
    if (cpupid_match_pid(p, last_cpupid))
        return true;
......

If the node of the CPU that accessed the mTHP last time is different
from this time, which means there is some contention for that mTHP among
threads. So it will not allow migration.

If the contention for the mTHP among threads is light or the accessing
is relatively stable, then we can allow migration?

For example, I think that we should avoid to migrate on the first fault
for mTHP in should_numa_migrate_memory().

More thoughts?  Can we add a field in struct folio for mTHP to count
hint page faults from the same node?

IIUC, we do not need add a new field for folio, seems we can reuse
->_flags_2a field. But how to use it? If there are multiple consecutive
NUMA faults from the same node, then allow migration?

_flags_2a cannot be used. You could place something after _deferred_list

Could you be more explicit? I didn't see _flags_2 currently being used,
did I miss something?

Yes, that we use it implicitly via page->flags on subpages (for example, some flags are still per-subpage and not per-folio).

--
Cheers,

David / dhildenb