Re: [PATCH -V10 1/9] mm/numa: automatically generate node migration order

From: Zi Yan
Date: Thu Jul 15 2021 - 13:53:08 EST


On 15 Jul 2021, at 1:51, Huang Ying wrote:

> From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
>
> Prepare for the kernel to auto-migrate pages to other memory nodes
> with a node migration table. This allows creating single migration
> target for each NUMA node to enable the kernel to do NUMA page
> migrations instead of simply discarding colder pages. A node with no
> target is a "terminal node", so reclaim acts normally there. The
> migration target does not fundamentally _need_ to be a single node,
> but this implementation starts there to limit complexity.
>
> When memory fills up on a node, memory contents can be
> automatically migrated to another node. The biggest problems are
> knowing when to migrate and to where the migration should be
> targeted.
>
> The most straightforward way to generate the "to where" list would
> be to follow the page allocator fallback lists. Those lists
> already tell us if memory is full where to look next. It would
> also be logical to move memory in that order.
>
> But, the allocator fallback lists have a fatal flaw: most nodes
> appear in all the lists. This would potentially lead to migration
> cycles (A->B, B->A, A->B, ...).
>
> Instead of using the allocator fallback lists directly, keep a
> separate node migration ordering. But, reuse the same data used
> to generate page allocator fallback in the first place:
> find_next_best_node().
>
> This means that the firmware data used to populate node distances
> essentially dictates the ordering for now. It should also be
> architecture-neutral since all NUMA architectures have a working
> find_next_best_node().
>
> RCU is used to allow lock-less read of node_demotion[] and prevent
> demotion cycles been observed. If multiple reads of node_demotion[]
> are performed, a single rcu_read_lock() must be held over all reads to
> ensure no cycles are observed. Details are as follows.
>
> === What does RCU provide? ===
>
> Imaginge a simple loop which walks down the demotion path looking

s/Imaginge/Imagine

> for the last node:
>
> terminal_node = start_node;
> while (node_demotion[terminal_node] != NUMA_NO_NODE) {
> terminal_node = node_demotion[terminal_node];
> }
>
> The initial values are:
>
> node_demotion[0] = 1;
> node_demotion[1] = NUMA_NO_NODE;
>
> and are updated to:
>
> node_demotion[0] = NUMA_NO_NODE;
> node_demotion[1] = 0;
>
> What guarantees that the cycle is not observed:
>
> node_demotion[0] = 1;
> node_demotion[1] = 0;
>
> and would loop forever?
>
> With RCU, a rcu_read_lock/unlock() can be placed around the
> loop. Since the write side does a synchronize_rcu(), the loop
> that observed the old contents is known to be complete before the
> synchronize_rcu() has completed.
>
> RCU, combined with disable_all_migrate_targets(), ensures that
> the old migration state is not visible by the time
> __set_migration_target_nodes() is called.
>
> === What does READ_ONCE() provide? ===
>
> READ_ONCE() forbids the compiler from merging or reordering
> successive reads of node_demotion[]. This ensures that any
> updates are *eventually* observed.
>
> Consider the above loop again. The compiler could theoretically
> read the entirety of node_demotion[] into local storage
> (registers) and never go back to memory, and *permanently*
> observe bad values for node_demotion[].
>
> Note: RCU does not provide any universal compiler-ordering
> guarantees:
>
> https://lore.kernel.org/lkml/20150921204327.GH4029@xxxxxxxxxxxxxxxxxx/
>
> This code is unused for now. It will be called later in the
> series.
>
> Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
> Signed-off-by: "Huang, Ying" <ying.huang@xxxxxxxxx>
> Reviewed-by: Yang Shi <shy828301@xxxxxxxxx>
> Reviewed-by: Oscar Salvador <osalvador@xxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxx>
> Cc: Wei Xu <weixugc@xxxxxxxxxx>
> Cc: Zi Yan <ziy@xxxxxxxxxx>
> Cc: David Rientjes <rientjes@xxxxxxxxxx>
> Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
> Cc: David Hildenbrand <david@xxxxxxxxxx>
>
> --
>
> Changes from 20210618:
> * Merge patches for data structure definition and initialization
> * Move RCU usage from the next patch in series per Zi's comments
>
> Changes from 20210302:
> * Fix typo in node_demotion[] comment
>
> Changes since 20200122:
> * Make node_demotion[] __read_mostly
> * Add big node_demotion[] comment
>
> Changes in July 2020:
> - Remove loop from next_demotion_node() and get_online_mems().
> This means that the node returned by next_demotion_node()
> might now be offline, but the worst case is that the
> allocation fails. That's fine since it is transient.
> ---
> mm/internal.h | 5 ++
> mm/migrate.c | 216 ++++++++++++++++++++++++++++++++++++++++++++++++
> mm/page_alloc.c | 2 +-
> 3 files changed, 222 insertions(+), 1 deletion(-)

LGTM. Reviewed-by: Zi Yan <ziy@xxxxxxxxxx>



Best Regards,
Yan, Zi

Attachment: signature.asc
Description: OpenPGP digital signature