Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()

From: David Hildenbrand
Date: Mon Jun 10 2024 - 04:56:17 EST


On 10.06.24 06:23, Oscar Salvador wrote:
On Fri, Jun 07, 2024 at 11:09:37AM +0200, David Hildenbrand wrote:
We currently initialize the memmap such that PG_reserved is set and the
refcount of the page is 1. In virtio-mem code, we have to manually clear
that PG_reserved flag to make memory offlining with partially hotplugged
memory blocks possible: has_unmovable_pages() would otherwise bail out on
such pages.

We want to avoid PG_reserved where possible and move to typed pages
instead. Further, we want to further enlighten memory offlining code about
PG_offline: offline pages in an online memory section. One example is
handling managed page count adjustments in a cleaner way during memory
offlining.

So let's initialize the pages with PG_offline instead of PG_reserved.
generic_online_page()->__free_pages_core() will now clear that flag before
handing that memory to the buddy.

Note that the page refcount is still 1 and would forbid offlining of such
memory except when special care is take during GOING_OFFLINE as
currently only implemented by virtio-mem.

With this change, we can now get non-PageReserved() pages in the XEN
balloon list. From what I can tell, that can already happen via
decrease_reservation(), so that should be fine.

HV-balloon should not really observe a change: partial online memory
blocks still cannot get surprise-offlined, because the refcount of these
PageOffline() pages is 1.

Update virtio-mem, HV-balloon and XEN-balloon code to be aware that
hotplugged pages are now PageOffline() instead of PageReserved() before
they are handed over to the buddy.

We'll leave the ZONE_DEVICE case alone for now.

Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 27e3be75edcf7..0254059efcbe1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -734,7 +734,7 @@ static inline void section_taint_zone_device(unsigned long pfn)
/*
* Associate the pfn range with the given zone, initializing the memmaps
* and resizing the pgdat/zone data to span the added pages. After this
- * call, all affected pages are PG_reserved.
+ * call, all affected pages are PageOffline().
*
* All aligned pageblocks are initialized to the specified migratetype
* (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
@@ -1100,8 +1100,12 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
- for (i = 0; i < nr_pages; i++)
- SetPageVmemmapSelfHosted(pfn_to_page(pfn + i));
+ for (i = 0; i < nr_pages; i++) {
+ struct page *page = pfn_to_page(pfn + i);
+
+ __ClearPageOffline(page);
+ SetPageVmemmapSelfHosted(page);

So, refresh my memory here please.
AFAIR, those VmemmapSelfHosted pages were marked Reserved before, but now,
memmap_init_range() will not mark them reserved anymore.

Correct.

I do not think that is ok? I am worried about walkers getting this wrong.

We usually skip PageReserved pages in walkers because are pages we cannot deal
with for those purposes, but with this change, we will leak
PageVmemmapSelfHosted, and I am not sure whether are ready for that.

There are fortunately not that many left.

I'd even say marking them (vmemmap) reserved is more wrong than right: note that ordinary vmemmap pages after memory hotplug are not reserved! Only bootmem should be reserved.

Let's take at the relevant core-mm ones (arch stuff is mostly just for MMIO remapping)

fs/proc/task_mmu.c: if (PageReserved(page))
fs/proc/task_mmu.c: if (PageReserved(page))

-> If we find vmemmap pages mapped into user space we already messed up
seriously

kernel/power/snapshot.c: if (PageReserved(page) ||
kernel/power/snapshot.c: if (PageReserved(page)

-> There should be no change (saveable_page() would still allow saving
them, highmem does not apply)

mm/hugetlb_vmemmap.c: if (!PageReserved(head))
mm/hugetlb_vmemmap.c: if (PageReserved(page))

-> Wants to identify bootmem, but we exclude these
PageVmemmapSelfHosted() on the splitting part already properly


mm/page_alloc.c: VM_WARN_ON_ONCE(PageReserved(p));
mm/page_alloc.c: if (PageReserved(page))

-> pfn_range_valid_contig() would scan them, just like for ordinary
vmemmap pages during hotplug. We'll simply fail isolating/migrating
them similarly (like any unmovable allocations) later

mm/page_ext.c: BUG_ON(PageReserved(page));

-> free_page_ext handling, does not apply

mm/page_isolation.c: if (PageReserved(page))

-> has_unmovable_pages() should still detect them as unmovable (e.g.,
neither movable nor LRU).

mm/page_owner.c: if (PageReserved(page))
mm/page_owner.c: if (PageReserved(page))

-> Simply page_ext_get() will return NULL instead and we'll similarly
skip them

mm/sparse.c: if (!PageReserved(virt_to_page(ms->usage))) {

-> Detecting boot memory for ms->usage allocation, does not apply to
vmemmap.

virt/kvm/kvm_main.c: if (!PageReserved(page))
virt/kvm/kvm_main.c: return !PageReserved(page);

-> For MMIO remapping purposes, does not apply to vmemmap


Moreover, boot memmap pages are marked as PageReserved, which would be
now inconsistent with those added during hotplug operations.

Just like vmemmap pages allocated dynamically during memory hotplug. Now, really only bootmem-ones are PageReserved.

All in all, I feel uneasy about this change.

I really don't want to mark these pages here PageReserved for the sake of it.

Any PageReserved user that I am missing, or why we should handle these vmemmap pages differently than the ones allocated during ordinary memory hotplug?

In the future, we might want to consider using a dedicated page type for them, so we can stop using a bit that doesn't allow to reliably identify them. (we should mark all vmemmap with that type then)

Thanks!

--
Cheers,

David / dhildenb