Re: [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers

From: Aneesh Kumar K V
Date: Mon Jun 06 2022 - 04:39:02 EST


On 6/6/22 12:54 PM, Ying Huang wrote:
On Mon, 2022-06-06 at 09:22 +0530, Aneesh Kumar K V wrote:
On 6/6/22 8:41 AM, Ying Huang wrote:
On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
With memory tiers support we can have memory on NUMA nodes
in the top tier from which we want to avoid promotion tracking NUMA
faults. Update node_is_toptier to work with memory tiers. To
avoid taking locks, a nodemask is maintained for all demotion
targets. All NUMA nodes are by default top tier nodes and as
we add new lower memory tiers NUMA nodes get added to the
demotion targets thereby moving them out of the top tier.

Check the usage of node_is_toptier(),

- migrate_misplaced_page()
   node_is_toptier() is used to check whether migration is a promotion.
We can avoid to use it. Just compare the rank of the nodes.

- change_pte_range() and change_huge_pmd()
   node_is_toptier() is used to avoid scanning fast memory (DRAM) pages
for promotion. So I think we should change the name to node_is_fast()
as follows,

static inline bool node_is_fast(int node)
{
return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM;
}


But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other
patches, absolute value of rank doesn't carry any meaning. It is only
the relative value w.r.t other memory tiers that decide whether it is
fast or not. Agreed by default memory tiers get built with
MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1'
Hence to determine a node is consisting of fast memory is essentially
figuring out whether node is the top most tier in memory hierarchy and
not just the memory tier rank value is >= MEMORY_RANK_DRAM?

In a system with 3 tiers,

HBM 0
DRAM 1
PMEM 2

In your implementation, only HBM will be considered fast. But what we
need is to consider both HBM and DRAM fast. Because we use NUMA
balancing to promote PMEM pages to DRAM. It's unnecessary to scan HBM
and DRAM pages for that. And there're no requirements to promote DRAM
pages to HBM with NUMA balancing.

I can understand that the memory tiers are more dynamic now. For
requirements of NUMA balancing, we need the lowest memory tier (rank)
where there's at least one node with CPU. The nodes in it and the
higher tiers will be considered fast.


is this good (not tested)?
/*
* build the allowed promotion mask. Promotion is allowed
* from higher memory tier to lower memory tier only if
* lower memory tier doesn't include compute. We want to
* skip promotion from a memory tier, if any node which is
* part of that memory tier have CPUs. Once we detect such
* a memory tier, we consider that tier as top tier from
* which promotion is not allowed.
*/
list_for_each_entry_reverse(memtier, &memory_tiers, list) {
nodes_and(allowed, node_state[N_CPU], memtier->nodelist);
if (nodes_empty(allowed))
nodes_or(promotion_mask, promotion_mask, allowed);
else
break;
}

and then

static inline bool node_is_toptier(int node)
{

return !node_isset(node, promotion_mask);
}