The current kernel has the basic memory tiering support: Inactive
pages on a higher tier NUMA node can be migrated (demoted) to a lower
tier NUMA node to make room for new allocations on the higher tier
NUMA node. Frequently accessed pages on a lower tier NUMA node can be
migrated (promoted) to a higher tier NUMA node to improve the
performance.
A tiering relationship between NUMA nodes in the form of demotion path
is created during the kernel initialization and updated when a NUMA
node is hot-added or hot-removed. The current implementation puts all
nodes with CPU into the top tier, and then builds the tiering hierarchy
tier-by-tier by establishing the per-node demotion targets based on
the distances between nodes.
The current memory tiering interface needs to be improved to address
several important use cases:
* The current tiering initialization code always initializes
each memory-only NUMA node into a lower tier. But a memory-only
NUMA node may have a high performance memory device (e.g. a DRAM
device attached via CXL.mem or a DRAM-backed memory-only node on
a virtual machine) and should be put into the top tier.
Tiering Hierarchy Initialization
================================
By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
A device driver can remove its memory nodes from the top tier, e.g.
a dax driver can remove PMEM nodes from the top tier.
The kernel builds the memory tiering hierarchy and per-node demotion
order tier-by-tier starting from N_TOPTIER_MEMORY. For a node N, the
best distance nodes in the next lower tier are assigned to
node_demotion[N].preferred and all the nodes in the next lower tier
are assigned to node_demotion[N].allowed.
node_demotion[N].preferred can be empty if no preferred demotion node
is available for node N.
Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
node.