Re: [PATCH 2/2] base/node: Use curr_node_memblock_intersect_memory_block to Get Memory Block NID if CONFIG_DEFERRED_STRUCT_PAGE_INIT is Set
From: Mike Rapoport
Date: Fri Apr 11 2025 - 07:04:24 EST
On Fri, Apr 11, 2025 at 12:27:28AM +0530, Donet Tom wrote:
>
> On 4/10/25 1:37 PM, Mike Rapoport wrote:
> > On Wed, Apr 09, 2025 at 10:57:57AM +0530, Donet Tom wrote:
> > > In the current implementation, when CONFIG_DEFERRED_STRUCT_PAGE_INIT is
> > > set, we iterate over all PFNs in the memory block and use
> > > early_pfn_to_nid to find the NID until a match is found.
> > >
> > > This patch we are using curr_node_memblock_intersect_memory_block() to
> > > check if the current node's memblock intersects with the memory block
> > > passed when CONFIG_DEFERRED_STRUCT_PAGE_INIT is set. If an intersection
> > > is found, the memory block is added to the current node.
> > >
> > > If CONFIG_DEFERRED_STRUCT_PAGE_INIT is not set, the existing mechanism
> > > for finding the NID will continue to be used.
> > I don't think we really need different mechanisms for different settings of
> > CONFIG_DEFERRED_STRUCT_PAGE_INIT.
> >
> > node_dev_init() runs after all struct pages are already initialized and can
> > always use pfn_to_nid().
>
>
> In the current implementation, if CONFIG_DEFERRED_STRUCT_PAGE_INIT
> is enabled, we perform a binary search in the memblock region to
> determine the pfn's nid. Otherwise, we use pfn_to_nid() to obtain
> the pfn's nid.
>
> Your point is that we could unify this logic and always use
> pfn_to_nid() to determine the pfn's nid, regardless of whether
> CONFIG_DEFERRED_STRUCT_PAGE_INIT is set. Is that
> correct?
Yes, struct pages should be ready by the time node_dev_init() is called
even when CONFIG_DEFERRED_STRUCT_PAGE_INIT is set.
> >
> > kernel_init_freeable() ->
> > page_alloc_init_late(); /* completes initialization of deferred pages */
> > ...
> > do_basic_setup() ->
> > driver_init() ->
> > node_dev_init();
> >
> > The next step could be refactoring register_mem_block_under_node_early() to
> > loop over memblock regions rather than over pfns.
> So it the current implementation
>
> node_dev_init()
> register_one_node
> register_memory_blocks_under_node
> walk_memory_blocks()
> register_mem_block_under_node_early
> get_nid_for_pfn
>
> We get each node's start and end PFN from the pg_data. Using these
> values, we determine the memory block's start and end within the
> current node. To identify the node to which these memory block
> belongs,we iterate over each PFN in the range.
>
> The problem I am facing is,
>
> In my system node4 has a memory block ranging from memory30351
> to memory38524, and memory128433. The memory blocks between
> memory38524 and memory128433 do not belong to this node.
>
> In walk_memory_blocks() we iterate over all memory blocks starting
> from memory38524 to memory128433.
> In register_mem_block_under_node_early(), up to memory38524, the
> first pfn correctly returns the corresponding nid and the function
> returns from there. But after memory38524 and until memory128433,
> the loop iterates through each pfn and checks the nid. Since the nid
> does not match the required nid, the loop continues. This causes
> the soft lockups.
>
> This issue occurs only when CONFIG_DEFERRED_STRUCT_PAGE_INIT
> is enabled, as a binary search is used to determine the PFN's nid. When
> this configuration is disabled, pfn_to_nid is faster, and the issue does
> not seen.( Faster because nid is getting from page)
>
> To speed up the code when CONFIG_DEFERRED_STRUCT_PAGE_INIT
> is enabled, I added this function that iterates over all memblock regions
> for each memory block to determine its nid.
>
> "Loop over memblock regions instead of iterating over PFNs" -
> My question is - in register_one_node, do you mean that we should iterate
> over all memblock regions, identify the regions belonging to the current
> node, and then retrieve the corresponding memory blocks to register them
> under that node?
I looked more closely at register_mem_block_under_node_early() and
iteration over memblock regions won't make sense there.
It might make sense to use for_each_mem_range() as top level loop in
node_dev_init(), but that's a separate topic.
> Thanks
> Donet
>
--
Sincerely yours,
Mike.