Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM

From: Aneesh Kumar K V
Date: Mon Jun 06 2022 - 06:23:39 EST


On 6/6/22 3:41 PM, Bharata B Rao wrote:
On 6/3/2022 2:34 PM, Aneesh Kumar K V wrote:
On 6/2/22 12:06 PM, Bharata B Rao wrote:
On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote:
On 6/1/22 11:59 AM, Bharata B Rao wrote:
I was experimenting with this patchset and found this behaviour.
Here's what I did:

Boot a KVM guest with vNVDIMM device which ends up with device_dax
driver by default.

Use it as RAM by binding it to dax kmem driver. It now appears as
RAM with a new NUMA node that is put to memtier1 (the existing tier
where DRAM already exists)


That should have placed it in memtier2.

I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
that expected to happen automatically when a node with dax kmem
device comes up?


This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier?

When we reach node_set_memory_tier_rank(), node1 (that has the pmem device)
is already part of memtier1 whose nodelist shows 0-1.


can you find out which code path added node1 to memtier1?

node_set_memory_tier_rank+0x63/0x80
migrate_on_reclaim_callback+0x40/0x4d
blocking_notifier_call_chain+0x68/0x90
memory_notify+0x1b/0x20
online_pages+0x257/0x2f0
memory_subsys_online+0x99/0x150
device_online+0x65/0x90
online_memory_block+0x1b/0x20
walk_memory_blocks+0x85/0xc0
? generic_online_page+0x40/0x40
add_memory_resource+0x1fa/0x2d0
add_memory_driver_managed+0x80/0xc0
dev_dax_kmem_probe+0x1af/0x250
dax_bus_probe+0x6e/0xa0

After this the explicit call to node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM)
from dev_dax_kmem_probe() finds that the memtier is already set.

Do you have regular memory also appearing on node1?

No, regular memory is on Node0.


Thanks for the stack trace. I was getting the kvm setup on my laptop to test this. We should move node_set_mem_tier() early. You had automatic online on memory hotplug

/* online pages if requested */
if (mhp_default_online_type != MMOP_OFFLINE)
walk_memory_blocks(start, size, NULL, online_memory_block);


which caused memory to be onlined before we could do node_set_mem_tier. That is a bug on my side. Will send you a change after testing .

-aneesh