[PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()

From: David Hildenbrand
Date: Fri Jun 07 2024 - 05:10:35 EST


We currently initialize the memmap such that PG_reserved is set and the
refcount of the page is 1. In virtio-mem code, we have to manually clear
that PG_reserved flag to make memory offlining with partially hotplugged
memory blocks possible: has_unmovable_pages() would otherwise bail out on
such pages.

We want to avoid PG_reserved where possible and move to typed pages
instead. Further, we want to further enlighten memory offlining code about
PG_offline: offline pages in an online memory section. One example is
handling managed page count adjustments in a cleaner way during memory
offlining.

So let's initialize the pages with PG_offline instead of PG_reserved.
generic_online_page()->__free_pages_core() will now clear that flag before
handing that memory to the buddy.

Note that the page refcount is still 1 and would forbid offlining of such
memory except when special care is take during GOING_OFFLINE as
currently only implemented by virtio-mem.

With this change, we can now get non-PageReserved() pages in the XEN
balloon list. From what I can tell, that can already happen via
decrease_reservation(), so that should be fine.

HV-balloon should not really observe a change: partial online memory
blocks still cannot get surprise-offlined, because the refcount of these
PageOffline() pages is 1.

Update virtio-mem, HV-balloon and XEN-balloon code to be aware that
hotplugged pages are now PageOffline() instead of PageReserved() before
they are handed over to the buddy.

We'll leave the ZONE_DEVICE case alone for now.

Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
---
drivers/hv/hv_balloon.c | 5 ++---
drivers/virtio/virtio_mem.c | 18 ++++++++++++------
drivers/xen/balloon.c | 9 +++++++--
include/linux/page-flags.h | 12 +++++-------
mm/memory_hotplug.c | 16 ++++++++++------
mm/mm_init.c | 10 ++++++++--
mm/page_alloc.c | 32 +++++++++++++++++++++++---------
7 files changed, 67 insertions(+), 35 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index e000fa3b9f978..c1be38edd8361 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -693,9 +693,8 @@ static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
if (!PageOffline(pg))
__SetPageOffline(pg);
return;
- }
- if (PageOffline(pg))
- __ClearPageOffline(pg);
+ } else if (!PageOffline(pg))
+ return;

/* This frame is currently backed; online the page. */
generic_online_page(pg, 0);
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index a3857bacc8446..b90df29621c81 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1146,12 +1146,16 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
for (; nr_pages--; pfn++) {
struct page *page = pfn_to_page(pfn);

- __SetPageOffline(page);
- if (!onlined) {
+ if (!onlined)
+ /*
+ * Pages that have not been onlined yet were initialized
+ * to PageOffline(). Remember that we have to route them
+ * through generic_online_page().
+ */
SetPageDirty(page);
- /* FIXME: remove after cleanups */
- ClearPageReserved(page);
- }
+ else
+ __SetPageOffline(page);
+ VM_WARN_ON_ONCE(!PageOffline(page));
}
page_offline_end();
}
@@ -1166,9 +1170,11 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn,
for (; nr_pages--; pfn++) {
struct page *page = pfn_to_page(pfn);

- __ClearPageOffline(page);
if (!onlined)
+ /* generic_online_page() will clear PageOffline(). */
ClearPageDirty(page);
+ else
+ __ClearPageOffline(page);
}
}

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index aaf2514fcfa46..528395133b4f8 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -146,7 +146,8 @@ static DECLARE_WAIT_QUEUE_HEAD(balloon_wq);
/* balloon_append: add the given page to the balloon. */
static void balloon_append(struct page *page)
{
- __SetPageOffline(page);
+ if (!PageOffline(page))
+ __SetPageOffline(page);

/* Lowmem is re-populated first, so highmem pages go at list tail. */
if (PageHighMem(page)) {
@@ -412,7 +413,11 @@ static enum bp_state increase_reservation(unsigned long nr_pages)

xenmem_reservation_va_mapping_update(1, &page, &frame_list[i]);

- /* Relinquish the page back to the allocator. */
+ /*
+ * Relinquish the page back to the allocator. Note that
+ * some pages, including ones added via xen_online_page(), might
+ * not be marked reserved; free_reserved_page() will handle that.
+ */
free_reserved_page(page);
}

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f04fea86324d9..e0362ce7fc109 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -30,16 +30,11 @@
* - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying
* to read/write these pages might end badly. Don't touch!
* - The zero page(s)
- * - Pages not added to the page allocator when onlining a section because
- * they were excluded via the online_page_callback() or because they are
- * PG_hwpoison.
* - Pages allocated in the context of kexec/kdump (loaded kernel image,
* control pages, vmcoreinfo)
* - MMIO/DMA pages. Some architectures don't allow to ioremap pages that are
* not marked PG_reserved (as they might be in use by somebody else who does
* not respect the caching strategy).
- * - Pages part of an offline section (struct pages of offline sections should
- * not be trusted as they will be initialized when first onlined).
* - MCA pages on ia64
* - Pages holding CPU notes for POWER Firmware Assisted Dump
* - Device memory (e.g. PMEM, DAX, HMM)
@@ -1021,6 +1016,10 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
* The content of these pages is effectively stale. Such pages should not
* be touched (read/write/dump/save) except by their owner.
*
+ * When a memory block gets onlined, all pages are initialized with a
+ * refcount of 1 and PageOffline(). generic_online_page() will
+ * take care of clearing PageOffline().
+ *
* If a driver wants to allow to offline unmovable PageOffline() pages without
* putting them back to the buddy, it can do so via the memory notifier by
* decrementing the reference count in MEM_GOING_OFFLINE and incrementing the
@@ -1028,8 +1027,7 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
* pages (now with a reference count of zero) are treated like free pages,
* allowing the containing memory block to get offlined. A driver that
* relies on this feature is aware that re-onlining the memory block will
- * require to re-set the pages PageOffline() and not giving them to the
- * buddy via online_page_callback_t.
+ * require not giving them to the buddy via generic_online_page().
*
* There are drivers that mark a page PageOffline() and expect there won't be
* any further access to page content. PFN walkers that read content of random
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 27e3be75edcf7..0254059efcbe1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -734,7 +734,7 @@ static inline void section_taint_zone_device(unsigned long pfn)
/*
* Associate the pfn range with the given zone, initializing the memmaps
* and resizing the pgdat/zone data to span the added pages. After this
- * call, all affected pages are PG_reserved.
+ * call, all affected pages are PageOffline().
*
* All aligned pageblocks are initialized to the specified migratetype
* (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
@@ -1100,8 +1100,12 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,

move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);

- for (i = 0; i < nr_pages; i++)
- SetPageVmemmapSelfHosted(pfn_to_page(pfn + i));
+ for (i = 0; i < nr_pages; i++) {
+ struct page *page = pfn_to_page(pfn + i);
+
+ __ClearPageOffline(page);
+ SetPageVmemmapSelfHosted(page);
+ }

/*
* It might be that the vmemmap_pages fully span sections. If that is
@@ -1959,9 +1963,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
* Don't allow to offline memory blocks that contain holes.
* Consequently, memory blocks with holes can never get onlined
* via the hotplug path - online_pages() - as hotplugged memory has
- * no holes. This way, we e.g., don't have to worry about marking
- * memory holes PG_reserved, don't need pfn_valid() checks, and can
- * avoid using walk_system_ram_range() later.
+ * no holes. This way, we don't have to worry about memory holes,
+ * don't need pfn_valid() checks, and can avoid using
+ * walk_system_ram_range() later.
*/
walk_system_ram_range(start_pfn, nr_pages, &system_ram_pages,
count_system_ram_pages_cb);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index feb5b6e8c8875..c066c1c474837 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -892,8 +892,14 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone

page = pfn_to_page(pfn);
__init_single_page(page, pfn, zone, nid);
- if (context == MEMINIT_HOTPLUG)
- __SetPageReserved(page);
+ if (context == MEMINIT_HOTPLUG) {
+#ifdef CONFIG_ZONE_DEVICE
+ if (zone == ZONE_DEVICE)
+ __SetPageReserved(page);
+ else
+#endif
+ __SetPageOffline(page);
+ }

/*
* Usually, we want to mark the pageblock MIGRATE_MOVABLE,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e0c8a8354be36..039bc52cc9091 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1225,18 +1225,23 @@ void __free_pages_core(struct page *page, unsigned int order,
* When initializing the memmap, __init_single_page() sets the refcount
* of all pages to 1 ("allocated"/"not free"). We have to set the
* refcount of all involved pages to 0.
+ *
+ * Note that hotplugged memory pages are initialized to PageOffline().
+ * Pages freed from memblock might be marked as reserved.
*/
- prefetchw(p);
- for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
- prefetchw(p + 1);
- __ClearPageReserved(p);
- set_page_count(p, 0);
- }
- __ClearPageReserved(p);
- set_page_count(p, 0);
-
if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG) &&
unlikely(context == MEMINIT_HOTPLUG)) {
+ prefetchw(p);
+ for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
+ prefetchw(p + 1);
+ VM_WARN_ON_ONCE(PageReserved(p));
+ __ClearPageOffline(p);
+ set_page_count(p, 0);
+ }
+ VM_WARN_ON_ONCE(PageReserved(p));
+ __ClearPageOffline(p);
+ set_page_count(p, 0);
+
/*
* Freeing the page with debug_pagealloc enabled will try to
* unmap it; some archs don't like double-unmappings, so
@@ -1245,6 +1250,15 @@ void __free_pages_core(struct page *page, unsigned int order,
debug_pagealloc_map_pages(page, nr_pages);
adjust_managed_page_count(page, nr_pages);
} else {
+ prefetchw(p);
+ for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
+ prefetchw(p + 1);
+ __ClearPageReserved(p);
+ set_page_count(p, 0);
+ }
+ __ClearPageReserved(p);
+ set_page_count(p, 0);
+
/* memblock adjusts totalram_pages() ahead of time. */
atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
}
--
2.45.1