Re: [PATCH] mm/memory hotplug/unplug: Optimize zone contiguous check when changing pfn range

From: David Hildenbrand (Arm)

Date: Mon Mar 23 2026 - 07:03:41 EST

On 3/19/26 10:56, Yuan Liu wrote:
> When invoke move_pfn_range_to_zone or remove_pfn_range_from_zone, it will
> update the zone->contiguous by checking the new zone's pfn range from the
> beginning to the end, regardless the previous state of the old zone. When
> the zone's pfn range is large, the cost of traversing the pfn range to
> update the zone->contiguous could be significant.
>
> Add a new zone's pages_with_memmap member, it is pages within the zone that
> have an online memmap. It includes present pages and memory holes that have
> a memmap. When spanned_pages == pages_with_online_memmap, pfn_to_page() can
> be performed without further checks on any pfn within the zone span.
>
> The following test cases of memory hotplug for a VM [1], tested in the
> environment [2], show that this optimization can significantly reduce the
> memory hotplug time [3].
>
> +----------------+------+---------------+--------------+----------------+
> | | Size | Time (before) | Time (after) | Time Reduction |
> | +------+---------------+--------------+----------------+
> | Plug Memory | 256G | 10s | 3s | 70% |
> | +------+---------------+--------------+----------------+
> | | 512G | 36s | 7s | 81% |
> +----------------+------+---------------+--------------+----------------+
>
> +----------------+------+---------------+--------------+----------------+
> | | Size | Time (before) | Time (after) | Time Reduction |
> | +------+---------------+--------------+----------------+
> | Unplug Memory | 256G | 11s | 4s | 64% |
> | +------+---------------+--------------+----------------+
> | | 512G | 36s | 9s | 75% |
> +----------------+------+---------------+--------------+----------------+
>
> [1] Qemu commands to hotplug 256G/512G memory for a VM:
> object_add memory-backend-ram,id=hotmem0,size=256G/512G,share=on
> device_add virtio-mem-pci,id=vmem1,memdev=hotmem0,bus=port1
> qom-set vmem1 requested-size 256G/512G (Plug Memory)
> qom-set vmem1 requested-size 0G (Unplug Memory)
>
> [2] Hardware : Intel Icelake server
> Guest Kernel : v7.0-rc4
> Qemu : v9.0.0
>
> Launch VM :
> qemu-system-x86_64 -accel kvm -cpu host \
> -drive file=./Centos10_cloud.qcow2,format=qcow2,if=virtio \
> -drive file=./seed.img,format=raw,if=virtio \
> -smp 3,cores=3,threads=1,sockets=1,maxcpus=3 \
> -m 2G,slots=10,maxmem=2052472M \
> -device pcie-root-port,id=port1,bus=pcie.0,slot=1,multifunction=on \
> -device pcie-root-port,id=port2,bus=pcie.0,slot=2 \
> -nographic -machine q35 \
> -nic user,hostfwd=tcp::3000-:22
>
> Guest kernel auto-onlines newly added memory blocks:
> echo online > /sys/devices/system/memory/auto_online_blocks
>
> [3] The time from typing the QEMU commands in [1] to when the output of
> 'grep MemTotal /proc/meminfo' on Guest reflects that all hotplugged
> memory is recognized.
>
> Reported-by: Nanhai Zou <nanhai.zou@xxxxxxxxx>
> Reported-by: Chen Zhang <zhangchen.kidd@xxxxxx>
> Tested-by: Yuan Liu <yuan1.liu@xxxxxxxxx>
> Reviewed-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@xxxxxxxxx>
> Reviewed-by: Yu C Chen <yu.c.chen@xxxxxxxxx>
> Reviewed-by: Pan Deng <pan.deng@xxxxxxxxx>
> Reviewed-by: Nanhai Zou <nanhai.zou@xxxxxxxxx>
> Reviewed-by: Yuan Liu <yuan1.liu@xxxxxxxxx>
> Co-developed-by: Tianyou Li <tianyou.li@xxxxxxxxx>
> Signed-off-by: Tianyou Li <tianyou.li@xxxxxxxxx>
> Signed-off-by: Yuan Liu <yuan1.liu@xxxxxxxxx>
> ---

[...]

>
> +/**
> + * zone_is_contiguous - test whether a zone is contiguous
> + * @zone: the zone to test.
> + *
> + * In a contiguous zone, it is valid to call pfn_to_page() on any pfn in the
> + * spanned zone without requiring pfn_valid() or pfn_to_online_page() checks.

I think there is a small catch to it: users should protect from concurrent
memory offlining. I recall that, for compaction, there either was some protection
in place or the race window was effectively impossible to hit.

Maybe we should add here for completion: "Note that missing synchronization with
memory offlining makes any PFN traversal prone to races."

> + *
> + * Returns: true if contiguous, otherwise false.
> + */
> +static inline bool zone_is_contiguous(const struct zone *zone)
> +{
> + return READ_ONCE(zone->spanned_pages) ==
> + READ_ONCE(zone->pages_with_online_memmap);

^ should be vertically aligned

> +}
> +
> static inline bool zone_is_initialized(const struct zone *zone)
> {
> return zone->initialized;
> diff --git a/mm/internal.h b/mm/internal.h
> index cb0af847d7d9..7c4c8ab68bde 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -793,21 +793,17 @@ extern struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
> static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> unsigned long end_pfn, struct zone *zone)
> {
> - if (zone->contiguous)
> + if (zone_is_contiguous(zone) && zone_spans_pfn(zone, start_pfn)) {

Do we really need the zone_spans_pfn() check? The caller must make sure that the
zone spans the PFN range before calling this function.
Compaction does that by walking only PFNs in the range.

The old "if (zone->contiguous)" check also expected a caller to handle that.

> + VM_BUG_ON(end_pfn > zone_end_pfn(zone));

No VM_BUG_ONs please. But I think we can also drop this.

> return pfn_to_page(start_pfn);
> + }
>
> return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
> }
>
> -void set_zone_contiguous(struct zone *zone);
> bool pfn_range_intersects_zones(int nid, unsigned long start_pfn,
> unsigned long nr_pages);
>
> -static inline void clear_zone_contiguous(struct zone *zone)
> -{
> - zone->contiguous = false;
> -}
> -
> extern int __isolate_free_page(struct page *page, unsigned int order);
> extern void __putback_isolated_page(struct page *page, unsigned int order,
> int mt);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index bc805029da51..2ba7a394a64b 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -492,11 +492,11 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> pfn = find_smallest_section_pfn(nid, zone, end_pfn,
> zone_end_pfn(zone));
> if (pfn) {
> - zone->spanned_pages = zone_end_pfn(zone) - pfn;
> + WRITE_ONCE(zone->spanned_pages, zone_end_pfn(zone) - pfn);
> zone->zone_start_pfn = pfn;
> } else {
> zone->zone_start_pfn = 0;
> - zone->spanned_pages = 0;
> + WRITE_ONCE(zone->spanned_pages, 0);
> }
> } else if (zone_end_pfn(zone) == end_pfn) {
> /*
> @@ -508,10 +508,10 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> pfn = find_biggest_section_pfn(nid, zone, zone->zone_start_pfn,
> start_pfn);
> if (pfn)
> - zone->spanned_pages = pfn - zone->zone_start_pfn + 1;
> + WRITE_ONCE(zone->spanned_pages, pfn - zone->zone_start_pfn + 1);
> else {
> zone->zone_start_pfn = 0;
> - zone->spanned_pages = 0;
> + WRITE_ONCE(zone->spanned_pages, 0);
> }
> }
> }

As the AI review points out, we should also make sure that
resize_zone_range() updates it with a WRITE_ONCE().

But I am starting to wonder if we should as a first step leave
the zone->contiguous bool in place. Then we have to worry less about
reorderings of reading/writing spanned_pages vs. pages_with_online_memmap.

See below

[...]

> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index df34797691bd..96690e550024 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -946,6 +946,7 @@ static void __init memmap_init_zone_range(struct zone *zone,
> unsigned long zone_start_pfn = zone->zone_start_pfn;
> unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages;
> int nid = zone_to_nid(zone), zone_id = zone_idx(zone);
> + unsigned long zone_hole_start, zone_hole_end;
>
> start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
> end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);
> @@ -957,8 +958,19 @@ static void __init memmap_init_zone_range(struct zone *zone,
> zone_end_pfn, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE,
> false);
>
> - if (*hole_pfn < start_pfn)
> + WRITE_ONCE(zone->pages_with_online_memmap,
> + READ_ONCE(zone->pages_with_online_memmap) +
> + (end_pfn - start_pfn));
> +
> + if (*hole_pfn < start_pfn) {
> init_unavailable_range(*hole_pfn, start_pfn, zone_id, nid);
> + zone_hole_start = clamp(*hole_pfn, zone_start_pfn, zone_end_pfn);
> + zone_hole_end = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
> + if (zone_hole_start < zone_hole_end)
> + WRITE_ONCE(zone->pages_with_online_memmap,
> + READ_ONCE(zone->pages_with_online_memmap) +
> + (zone_hole_end - zone_hole_start));
> + }

The range can have larger holes without a memmap, and I think we would be
missing pages handled by the other init_unavailable_range() call?

There is one question for Mike, though: couldn't it happen that the
init_unavailable_range() call in memmap_init() would initialize
the memmap outside of the node/zone span? If so, I wonder whether we
would want to adjust the node+zone space to include these ranges.

Later memory onlining could make these ranges suddenly fall into the
node/zone span.

So that requires some thought.

Maybe we should start with this (untested):