[RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving

From: Gregory Price
Date: Wed Oct 11 2023 - 16:44:13 EST


v2: change memtier mutex to semaphore
add source-node relative weighting
add remaining mempolicy integration code

= v2 Notes

Developed in colaboration with original authors to deconflict
similar efforts to extend mempolicy to take weights directly.

== Mutex to Semaphore change:

The memory tiering subsystem is extended in this patch set to have
externally available information (weights), and therefore additional
controls need to be added to ensure values are not changed (or tiers
changed/added/removed) during various calculations.

Since it is expected that many threads will be accessing this data
during allocations, a mutex is not appropriate.

Since write-updates (weight changes, hotplug events) are rare events,
a simple rw semaphore is sufficient.

== Source-node relative weighting:

Tiers can now be weighted differently based on the node requesting
the weight. For example CPU-Nodes 0 and 1 may have different weights
for the same CXL memory tier, because topologically the number of
NUMA hops is greater (or any other physical topological difference
resulting in different effective latency or bandwidth values)

1. Set weights for DDR (tier4) and CXL(teir22) tiers.
echo source_node:weight > /path/to/interleave_weight

# Set tier4 weight from node 0 to 85
echo 0:85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
# Set tier4 weight from node 1 to 65
echo 1:65 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
# Set tier22 weight from node 0 to 15
echo 0:15 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight
# Set tier22 weight from node 1 to 10
echo 1:10 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight

== Mempolicy integration

Two new functions have been added to memory-tiers.c
* memtier_get_node_weight
- Get the effective weight for a given node
* memtier_get_total_weight
- Get the "total effective weight" for a given nodemask.

These functions are used by the following functions in mempolicy:
* interleave_nodes
* offset_il_nodes
* alloc_pages_bulk_array_interleave

The weight values are used to determine how many pages should be
allocated per-node as interleave rounds occur.

To avoid holding the memtier semaphore for long periods of time
(e.g. during the calls that actually allocate pages), there is
a small race condition during bulk allocation between calculating
the total weight of a node mask and fetching each individual
node weight - but this is managed by simply detecting the over/under
allocation conditions and handling them accordingly.

~Gregory

=== original RFC ====

From: Ravi Shankar <ravis.opensrc@xxxxxxxxxx>

Hello,

The current interleave policy operates by interleaving page requests
among nodes defined in the memory policy. To accommodate the
introduction of memory tiers for various memory types (e.g., DDR, CXL,
HBM, PMEM, etc.), a mechanism is needed for interleaving page requests
across these memory types or tiers.

This can be achieved by implementing an interleaving method that
considers the tier weights.
The tier weight will determine the proportion of nodes to select from
those specified in the memory policy.
A tier weight can be assigned to each memory type within the system.

Hasan Al Maruf had put forth a proposal for interleaving between two
tiers, namely the top tier and the low tier. However, this patch was
not adopted due to constraints on the number of available tiers.

https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@xxxxxxxxxxx/T/

New proposed changes:

1. Introducea sysfs entry to allow setting the interleave weight for each
memory tier.
2. Each tier with a default weight of 1, indicating a standard 1:1
proportion.
3. Distribute the weight of that tier in a uniform manner across all nodes.
4. Modifications to the existing interleaving algorithm to support the
implementation of multi-tier interleaving based on tier-weights.

This is inline with Huang, Ying's presentation in lpc22, 16th slide in
https://lpc.events/event/16/contributions/1209/attachments/1042/1995/\
Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf

Observed a significant increase (165%) in bandwidth utilization
with the newly proposed multi-tier interleaving compared to the
traditional 1:1 interleaving approach between DDR and CXL tier nodes,
where 85% of the bandwidth is allocated to DDR tier and 15% to CXL
tier with MLC -w2 option.

Usage Example:

1. Set weights for DDR (tier4) and CXL(teir22) tiers.
echo 85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
echo 15 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight

2. Interleave between DRR(tier4, node-0) and CXL (tier22, node-1) using numactl
numactl -i0,1 mlc --loaded_latency W2

Gregory Price (3):
mm/memory-tiers: change mutex to rw semaphore
mm/memory-tiers: Introduce sysfs for tier interleave weights
mm/mempolicy: modify interleave mempolicy to use memtier weights

include/linux/memory-tiers.h | 16 ++++
include/linux/mempolicy.h | 3 +
mm/memory-tiers.c | 179 +++++++++++++++++++++++++++++++----
mm/mempolicy.c | 148 +++++++++++++++++++++++------
4 files changed, 297 insertions(+), 49 deletions(-)

--
2.39.1