Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes

From: Honggyu Kim
Date: Tue Mar 04 2025 - 07:54:03 EST


Hi Joshua,

On 3/4/2025 6:56 AM, Joshua Hahn wrote:
On Thu, 27 Feb 2025 12:20:03 +0900 Honggyu Kim <honggyu.kim@xxxxxx> wrote:

Hi Honggyu, thank you for taking time to review my patch, as always!

My pleasure!

I thought I had sent this, but it seems like it was left in my draft
without being sent.

I will follow Gregory's advice and we will drop the patch from this series,
and send the first patch only (with Yunjeong's changes). Thanks again!

It'd be great if you could add her with the following.
Co-developed-by: Yunjeong Mun <yunjeong.mun@xxxxxx>



On 2/27/2025 11:32 AM, Honggyu Kim wrote:
Hi Joshua,

On 2/27/2025 6:35 AM, Joshua Hahn wrote:
We should never try to allocate memory from a memoryless node. Creating a
sysfs knob to control its weighted interleave weight does not make sense,
and can be unsafe.

Only create weighted interleave weight knobs for nodes with memory.

Signed-off-by: Joshua Hahn <joshua.hahnjy@xxxxxxxxx>
---
  mm/mempolicy.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4cc04ff8f12c..50cbb7c047fa 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -3721,7 +3721,7 @@ static int add_weighted_interleave_group(struct
kobject *root_kobj)
          return err;
      }
-    for_each_node_state(nid, N_POSSIBLE) {

Actually, we're aware of this issue and currently trying to fix this.
In our system, we've attached 4ch of CXL memory for each socket as
follows.

        node0             node1
      +-------+   UPI   +-------+
      | CPU 0 |-+-----+-| CPU 1 |
      +-------+         +-------+
      | DRAM0 |         | DRAM1 |
      +---+---+         +---+---+
          |                 |
      +---+---+         +---+---+
      | CXL 0 |         | CXL 4 |
      +---+---+         +---+---+
      | CXL 1 |         | CXL 5 |
      +---+---+         +---+---+
      | CXL 2 |         | CXL 6 |
      +---+---+         +---+---+
      | CXL 3 |         | CXL 7 |
      +---+---+         +---+---+
        node2             node3

The 4ch of CXL memory are detected as a single NUMA node in each socket,
but it shows as follows with the current N_POSSIBLE loop.

$ ls /sys/kernel/mm/mempolicy/weighted_interleave/
node0 node1 node2 node3 node4 node5
node6 node7 node8 node9 node10 node11

FYI, we used to set node2 and node3 only for weights for CXL memory here
and ignored node{4-11}. That sounds silly but it worked.


I see. For my education, would you mind explaining how the numbering works
here? I am not very familiar with this setup, and not sure how you would
figure out what node is which, just by looking at the numbering.

Regarding the numbering, I'm not 100% sure, but I guess there could be a
logical NUMA node that combines 4ch of CXL memory and 4 nodes for CXL memory so in total 5 nodes per socket.

I don't have much knowledge on this but maybe this is related to PXM
(Proximity Domain).


+    for_each_node_state(nid, N_MEMORY) {

Thinking it again, we can leave it as a separate patch but add our patch
on top of it.

That sounds good to me.

The only concern I have is having only N_MEMORY patch hides weight
setting knobs for CXL memory and it makes there is no way to set weight
values to CXL memory in my system.

You can use weighted interleave auto-tuning : -)

Not possible because using N_MEMORY doesn't provide "node" knobs for CXL
memory at all as follows.

$ ls /sys/kernel/mm/mempolicy/weighted_interleave/
node0 node1

We need node2 and node3 for CXL memory here.

In all seriousness, this makes sense. It seems pretty problematic that
the knobs aren't created for the CXL channels,

Yeah, it's even worse than the current status.

and I'm not sure that hiding> it is the correct approach here (it was not my intent, either).

It isn't your problem but we shouldn't hide those nodes until it is
correctly fixed with hot plugging event handler.


IMHO, this and our patch is better to be submitted together.

That sounds good. We can hold off on this patch then, and just consider
the first patch of this series. Thank you for letting me know!

The N_POSSIBLE and N_MEMORY stuffs should had been fixed earlier than
this work. I will take a few days if we can submit it together.


Thank you for always reviewing my patches. Have a great day!
Joshua

Thanks for your work and have a great day you too!

Kind regards,
Honggyu


Sent using hkml (https://github.com/sjp38/hackermail)