Re: [RFC PATCH V1 09/13] mm: Add heuristic to calculate target node

From: Raghavendra K T
Date: Mon Mar 24 2025 - 10:55:47 EST




On 3/24/2025 4:35 PM, Hillf Danton wrote:
On Sun, 23 Mar 2025 23:44:02 +0530 Raghavendra K T wrote
On 3/21/2025 4:23 PM, Hillf Danton wrote:
On Wed, 19 Mar 2025 19:30:24 +0000 Raghavendra K T wrote
One of the key challenges in PTE A bit based scanning is to find right
target node to promote to.

Here is a simple heuristic based approach:
While scanning pages of any mm we also scan toptier pages that belong
to that mm. We get an insight on the distribution of pages that potentially
belonging to particular toptier node and also its recent access.

Current logic walks all the toptier node, and picks the one with highest
accesses.

My $.02 for selecting promotion target node given a simple multi tier system.

Tk /* top Tierk (k > 0) has K (K > 0) nodes */
...
Tj /* Tierj (j > 0) has J (J > 0) nodes */
...
T0 /* bottom Tier0 has O (O > 0) nodes */

Unless config comes from user space (sysfs window for example should be opened),

1, adopt the data flow pattern of L3 cache <--> DRAM <--> SSD, to only
select Tj+1 when promoting pages in Tj.


Hello Hillf ,
Thanks for giving a thought on this. This looks to be good idea in
general. Mostly be able to implement with reverse of preferred demotion
target?

Thinking loud, Can there be exception cases similar to non-temporal copy
operations, where we don't want to pollute cache?
I mean cases we don't want to hop via middle tier node..?

Given page cache, direct IO and coherent DMA have their roles to play.


Agree.

2, select the node in Tj+1 that has the most free pages for promotion
by default.

Not sure if this is productive always.

Trying to cure all pains with ONE pill wastes minutes I think.


Very much true.

To achive reliable high order pages, page allocator can not work well in
combination with kswapd and kcompactd without clear boundaries drawn in
between the tree parties for example.

for e.g.
node 0-1 toptier (100GB)
node2 slowtier

suppose a workload (that occupies 80GB in total) running on CPU of node1
where 40GB is already in node1 rest of 40GB is in node2.

Now it is preferred to consolidate workload on node1 when slowtier
data becomes hot?

Yes and no (say, a couple seconds later mm pressure rises in node0).

In case of yes, I would like to turn on autonuma in the toptier instead
without bothering to select the target node. You see a line is drawn
between autonma and slowtier promotion now.

Yes, the goal has been slow tier promotion without much overhead to the
system + co-cooperatively work with NUMAB1 for top-tier balancing.
(for e.g., providing hints of hot VMAs).