Re: [PATCH v2 0/5] mm: demotion: Introduce new node state N_DEMOTION_TARGETS

From: Jagdish Gediya
Date: Thu Apr 14 2022 - 06:17:21 EST


On Wed, Apr 13, 2022 at 02:44:34PM -0700, Andrew Morton wrote:
> On Wed, 13 Apr 2022 14:52:01 +0530 Jagdish Gediya <jvgediya@xxxxxxxxxxxxx> wrote:
>
> > Current implementation to find the demotion targets works
> > based on node state N_MEMORY, however some systems may have
> > dram only memory numa node which are N_MEMORY but not the
> > right choices as demotion targets.
>
> Why are they not the right choice? Please describe this fully so we
> can understand the motivation and end-user benefit of the proposed
> change. And please more fully describe the end-user benefits of this
> change.

Some systems(e.g. PowerVM) have DRAM(fast memory) only NUMA node
which are N_MEMORY as well as slow memory(persistent memory) only
NUMA node which are also N_MEMORY. As the current demotion target
finding algorithm works based on N_MEMORY and best distance, it will
choose DRAM only NUMA node as demotion target instead of persistent
memory node on such systems. If DRAM only NUMA node is filled with
demoted pages then at some point new allocations can start falling
to persistent memory, so basically cold pages are in fast memor
(due to demotion) and new pages are in slow memory, this is why
persistent memory nodes should be utilized for demotion and dram node
should be avoided for demotion so that they can be used for new
allocations.

Current implementation can work fine on the system where the memory
only numa nodes are possible only for persistent/slow memory but it
is not suitable for the like of systems I have mentioned above.

Introduction of this new node state N_DEMOTION_TARGETS will provide
the solution to handle demotion for the like of systems I have mentioned,
without affecting the existing behavior.

> > This patch series introduces the new node state
> > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > is used to hold the list of nodes which can be used as demotion
> > targets, support is also added to set the demotion target
> > list from user space so that default behavior can be overridden.
>
> Permanently extending the kernel ABI is a fairly big deal. Please
> fully explain the end-user value, usage scenarios, etc.
>
> What would go wrong if we simply omitted this interface?

I am going to modify this interface according to review feedback in
next version, but let me explain why it is needed with examples,

Based on topology, and available memory tiers in the system, it may
be possible that users don't want to utilize all the demotion targets
configured by kernel by default for e.g.,

1. To reduce cross socket traffic
2. To use only slowest memory as demotion targets when there are
multiple slow memory only nodes available

The current patch series handles option 2 above, but doesn't handle
option 1 so next version will have that support and might be different
implementation to handle such scenarios.

Examples 1
----------

with below NUMA topology, where node 0 & 1 are cpu + dram nodes,
node 2 & 3 are equally slower memory only nodes, and node 4
is slowest memory only node,

available: 5 nodes (0-4)
node 0 cpus: 0 1
node 0 size: n MB
node 0 free: n MB
node 1 cpus: 2 3
node 1 size: n MB
node 1 free: n MB
node 2 cpus:
node 2 size: n MB
node 2 free: n MB
node 3 cpus:
node 3 size: n MB
node 3 free: n MB
node 4 cpus:
node 4 size: n MB
node 4 free: n MB
node distances:
node 0 1 2 3 4
0: 10 20 40 40 80
1: 20 10 40 40 80
2: 40 40 10 40 80
3: 40 40 40 10 80
4: 80 80 80 80 10

This patch series by default prepares below demotion list,

node demotion_target
0 3, 2
1 3, 2
2 4
3 4
4 X

but It may be possible that user want to utilize node 2 & 3 only
for new allocations and only node 4 for demotion.

Example 2
---------

with below NUMA topology where Node 0 & 2 are cpu + dram nodes and
node 1 is slow memory node near node 0,

available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: n MB
node 0 free: n MB
node 1 cpus:
node 1 size: n MB
node 1 free: n MB
node 2 cpus: 2 3
node 2 size: n MB
node 2 free: n MB
node distances:
node 0 1 2
0: 10 40 20
1: 40 10 80
2: 20 80 10

This patch series by default prepares below demotion list,

node demotion_target
0 1
1 X
2 1

However it may be possible that user may want to avoid node 1 as
demotion target for node 2 to reduce cross socket traffic.

> > node state N_DEMOTION_TARGETS is also set from the dax kmem
> > driver, certain type of memory which registers through dax kmem
> > (e.g. HBM) may not be the right choices for demotion so in future
> > they should be distinguished based on certain attributes and dax
> > kmem driver should avoid setting them as N_DEMOTION_TARGETS,
> > however current implementation also doesn't distinguish any
> > such memory and it considers all N_MEMORY as demotion targets
> > so this patch series doesn't modify the current behavior.
> >
> > Current code which sets migration targets is modified in
> > this patch series to avoid some of the limitations on the demotion
> > target sharing and to use N_DEMOTION_TARGETS only nodes while
> > finding demotion targets.
> >
> > Changelog
> > ----------
> >
> > v2:
> > In v1, only 1st patch of this patch series was sent, which was
> > implemented to avoid some of the limitations on the demotion
> > target sharing, however for certain numa topology, the demotion
> > targets found by that patch was not most optimal, so 1st patch
> > in this series is modified according to suggestions from Huang
> > and Baolin. Different examples of demotion list comparasion
> > between existing implementation and changed implementation can
> > be found in the commit message of 1st patch.
> >
> > Jagdish Gediya (5):
> > mm: demotion: Set demotion list differently
> > mm: demotion: Add new node state N_DEMOTION_TARGETS
> > mm: demotion: Add support to set targets from userspace
> > device-dax/kmem: Set node state as N_DEMOTION_TARGETS
> > mm: demotion: Build demotion list based on N_DEMOTION_TARGETS
> >
> > .../ABI/testing/sysfs-kernel-mm-numa | 12 ++++
>
> This description is rather brief. Some additional user-facing material
> under Documentation/ would help. Describe the format for writing to the
> file, what is seen when reading from it, provide a bit of help to the
> user so they can understand how to use it, what effects they might see,
> etc.

Sure, Will do in next version.

> > drivers/base/node.c | 4 ++
> > drivers/dax/kmem.c | 2 +
> > include/linux/nodemask.h | 1 +
> > mm/migrate.c | 67 +++++++++++++++----
> > 5 files changed, 72 insertions(+), 14 deletions(-)
>