Re: [PATCH] driver/base/node.c: Fix softlockups during the initialization of large systems with interleaved memory blocks

From: Donet Tom
Date: Tue Mar 11 2025 - 11:00:37 EST



On 3/11/25 2:59 PM, David Hildenbrand wrote:
On 11.03.25 09:56, Donet Tom wrote:

On 3/10/25 6:22 PM, Greg Kroah-Hartman wrote:
On Mon, Mar 10, 2025 at 06:53:05AM -0500, Donet Tom wrote:
On large systems with more than 64TB of DRAM, if the memory blocks
are interleaved, node initialization (node_dev_init()) could take
a long time since it iterates over each memory block. If the memory
block belongs to the current iterating node, the first pfn_to_nid
will provide the correct value. Otherwise, it will iterate over all
PFNs and check the nid. On non-preemptive kernels, this can result
in a watchdog softlockup warning. Even though CONFIG_PREEMPT_LAZY
is enabled in kernels now [1], we may still need to fix older
stable kernels to avoid encountering these kernel warnings during
boot.

This patch adds a cond_resched() call in node_dev_init() to avoid
this warning.

node_dev_init()
      register_one_node
          register_memory_blocks_under_node
              walk_memory_blocks()
                  register_mem_block_under_node_early
                      get_nid_for_pfn
                          early_pfn_to_nid

In my system node4 has a memory block ranging from memory30351
to memory38524, and memory128433. The memory blocks between
memory38524 and memory128433 do not belong to this node.

In  walk_memory_blocks() we iterate over all memblocks starting
from memory38524 to memory128433.
In register_mem_block_under_node_early(), up to memory38524, the
first pfn correctly returns the corresponding nid and the function
returns from there. But after memory38524 and until memory128433,
the loop iterates through each pfn and checks the nid. Since the nid
does not match the required nid, the loop continues. This causes
the soft lockups.

[1]: https://lore.kernel.org/linuxppc-dev/20241116192306.88217-1-sshegde@xxxxxxxxxxxxx/
Fixes: 2848a28b0a60 ("drivers/base/node: consolidate node device subsystem initialization in node_dev_init()")
Signed-off-by: Donet Tom <donettom@xxxxxxxxxxxxx>
---
   drivers/base/node.c | 1 +
   1 file changed, 1 insertion(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0ea653fa3433..107eb508e28e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -975,5 +975,6 @@ void __init node_dev_init(void)
           ret = register_one_node(i);
           if (ret)
               panic("%s() failed to add node: %d\n", __func__, ret);
+        cond_resched();
That's a horrible hack, sorry, but no, we can't sprinkle this around in
random locations, especially as this is actually fixed by using a
different scheduler model as you say.

Why not just make the code faster so as to avoid the long time this
takes?


Thanks Greg

I was checking the code to see how to make it faster in order to
avoid the long time it takes.

In below code path

register_one_node()
      register_memory_blocks_under_node()
          walk_memory_blocks()
              register_mem_block_under_node_early()

walk_memory_blocks() is iterating over all memblocks, and
register_mem_block_under_node_early() is iterating over the PFNs
to find the page_nid

If the page_nid and the requested nid are the same, we will register
the memblock under the node and return.

But if get_nid_for_pfn() returns a different nid (This means the
memblock is part of different nid), then the loop will iterate
over all PFNs of the memblock and check if the page_nid returned by
get_nid_for_pfn() and the requested nid are the same.

IIUC, since all PFNs of a memblock return the same page_nid, we
should stop the loop if the page_nid returned does not match the
requested nid.

Nodes can easily partially span "memory blocks". So your patch would break these setups?


Does this mean one memory block can be part of two or
more nodes ? Some PFNs belong to one node, and the remaining
PFNs belong to another node?"

In that case, the current implementation adds the memory block to
only one node. In register_mem_block_under_node_early(), if the
first PFN returns the correct expected nid, the memory block will
be registered under that node. It does not iterate over the other
PFNs. Is this because of the assumption that one memory block
cannot be part of multiple nodes?

If one memory block cannot be part of multiple nodes, then we can
break if get_nid_for_pfn() returns the wrong nid, right?



But I agree that iterating all pages is rather nasty. I wonder if we could just walk all memblocks in the range?

early_pfn_to_nid()->__early_pfn_to_nid() would lookup the memblock ... for each PFN. Testing a range instead could be better.

Something like "early_pfn_range_spans_nid()" could be useful for that.

Do you mean we should do it section by section within a memory block?

Thanks
Donet