Re: [PATCH 2/3] mm/memory_hotplug: Reset node's state when empty during offline

From: David Hildenbrand
Date: Thu Apr 28 2022 - 08:30:45 EST


On 07.03.22 16:07, Oscar Salvador wrote:
> All possible nodes are now pre-allocated at boot time by free_area_init()->
> free_area_init_node(), and those which are to be hot-plugged are initialized
> later on by hotadd_init_pgdat()->free_area_init_core_hotplug() when they
> become online.
>
> free_area_init_core_hotplug() calls pgdat_init_internals() and
> zone_init_internals() to initialize some internal data structures
> and zeroes a few pgdat fields.
>
> But we do already call pgdat_init_internals() and zone_init_internals()
> for all possible nodes back in free_area_init_core(), and pgdat fields
> are already zeroed because the pre-allocation memsets with 0s the
> structure, meaning we do not need to repeat the process when
> the node becomes online.
>
> So initialize it only once when booting, and make sure to reset
> the fields we care about to 0 when the node goes empty.
> The only thing we need to check for is to allocate per_cpu_nodestats
> struct the very first time this node goes online.
>
> node_reset_state() is the function in charge of resetting pgdat's fields,
> and it is called when offline_pages() detects that the node becomes empty
> worth of memory.
>
> Signed-off-by: Oscar Salvador <osalvador@xxxxxxx>
> ---
> include/linux/memory_hotplug.h | 2 +-
> mm/memory_hotplug.c | 58 +++++++++++++++++++++-------------
> mm/page_alloc.c | 49 +++++-----------------------
> 3 files changed, 45 insertions(+), 64 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 76bf2de86def..fcf4c9a023cc 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -319,7 +319,7 @@ extern void set_zone_contiguous(struct zone *zone);
> extern void clear_zone_contiguous(struct zone *zone);
>
> #ifdef CONFIG_MEMORY_HOTPLUG
> -extern void __ref free_area_init_core_hotplug(struct pglist_data *pgdat);
> +extern bool pgdat_has_boot_nodestats(pg_data_t *pgdat);
> extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
> extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
> extern int add_memory_resource(int nid, struct resource *resource,
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index ddc62f8b591f..07cece9e22e4 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1164,18 +1164,18 @@ static void reset_node_present_pages(pg_data_t *pgdat)
> /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
> static pg_data_t __ref *hotadd_init_pgdat(int nid)
> {
> - struct pglist_data *pgdat;
> + struct pglist_data *pgdat = NODE_DATA(nid);
>
> /*
> - * NODE_DATA is preallocated (free_area_init) but its internal
> - * state is not allocated completely. Add missing pieces.
> - * Completely offline nodes stay around and they just need
> - * reintialization.
> + * NODE_DATA is preallocated (free_area_init), the only thing missing
> + * is to allocate its per_cpu_nodestats struct and to build node's
> + * zonelists. The allocation of per_cpu_nodestats only needs to be done
> + * the very first time this node is brought up, as we reset its state
> + * when all node's memory goes offline.
> */
> - pgdat = NODE_DATA(nid);
> -
> - /* init node's zones as empty zones, we don't have any present pages.*/
> - free_area_init_core_hotplug(pgdat);
> + if (pgdat_has_boot_nodestats(pgdat))
> + pgdat->per_cpu_nodestats = alloc_percpu_gfp(struct per_cpu_nodestat,
> + __GFP_ZERO);
>
> /*
> * The node we allocated has no zone fallback lists. For avoiding
> @@ -1183,15 +1183,6 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid)
> */
> build_all_zonelists(pgdat);
>
> - /*
> - * When memory is hot-added, all the memory is in offline state. So
> - * clear all zones' present_pages because they will be updated in
> - * online_pages() and offline_pages().
> - * TODO: should be in free_area_init_core_hotplug?
> - */
> - reset_node_managed_pages(pgdat);
> - reset_node_present_pages(pgdat);
> -
> return pgdat;
> }
>
> @@ -1799,6 +1790,30 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
> node_clear_state(node, N_MEMORY);
> }
>
> +static void node_reset_state(int node)
> +{
> + pg_data_t *pgdat = NODE_DATA(node);
> + int cpu;
> +
> + kswapd_stop(node);
> + kcompactd_stop(node);
> +
> + reset_node_managed_pages(pgdat);
> + reset_node_present_pages(pgdat);
> +
> + pgdat->nr_zones = 0;
> + pgdat->kswapd_order = 0;
> + pgdat->kswapd_highest_zoneidx = 0;
> + pgdat->node_start_pfn = 0;


I'm confused why we have to mess with
* present pages
* managed pages
* node_start_pfn

here at all.

1) If there would be any present page left, calling node_reset_state()
would be a BUG.
2) If there would be any manged page left, calling node_reset_state()
would be a BUG.
3) node_start_pfn will be properly updated by
remove_pfn_range_from_zone()->update_pgdat_span()


To make it clearer, I *think* touching node_start_pfn is very wrong.

What if the node still has ZONE_DEVICE? They don't account towards
present pages but only towards spanned pages, and we're messing with the
start range.

remove_pfn_range_from_zone()->update_pgdat_span() should be the only
place that modifies the spanned range when offlining.

--
Thanks,

David / dhildenb