Re: [PATCH v2 4/5] mm,memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap

From: David Hildenbrand
Date: Tue Jun 25 2019 - 04:49:25 EST


On 25.06.19 09:52, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
>
> This has some disadvantages:
> a) an existing memory is consumed for that purpose
> (~2MB per 128MB memory section on x86_64)
> b) if the whole node is movable then we have off-node struct pages
> which has performance drawbacks.
>
> a) has turned out to be a problem for memory hotplug based ballooning
> because the userspace might not react in time to online memory while
> the memory consumed during physical hotadd consumes enough memory to
> push system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
> policy for the newly added memory") has been added to workaround that
> problem.
>
> I have also seen hot-add operations failing on powerpc due to the fact
> that we try to use order-8 pages. If the base page size is 64KB, this
> gives us 16MB, and if we run out of those, we simply fail.
> One could arge that we can fall back to basepages as we do in x86_64, but
> we can do better when CONFIG_SPARSEMEM_VMEMMAP is enabled.
>
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
>
> Implementation wise we reuse vmem_altmap infrastructure to override
> the default allocator used by __vmemap_populate. Once the memmap is
> allocated we need a way to mark altmap pfns used for the allocation.
> If MHP_MEMMAP_{DEVICE,MEMBLOCK} flag was passed, we set up the layout of the
> altmap structure at the beginning of __add_pages(), and then we call
> mark_vmemmap_pages().
>
> Depending on which flag is passed (MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK),
> mark_vmemmap_pages() gets called at a different stage.
> With MHP_MEMMAP_MEMBLOCK, we call it once we have populated the sections
> fitting in a single memblock, while with MHP_MEMMAP_DEVICE we wait until all
> sections have been populated.

So, only MHP_MEMMAP_DEVICE will be used. Would it make sense to only
implement one for now (after we decide which one to use), to make things
simpler?

Or do you have a real user in mind for the other?

>
> mark_vmemmap_pages() marks the pages as vmemmap and sets some metadata:
>
> The current layout of the Vmemmap pages are:
>
> [Head->refcount] : Nr sections used by this altmap
> [Head->private] : Nr of vmemmap pages
> [Tail->freelist] : Pointer to the head page
>
> This is done to easy the computation we need in some places.
> E.g:
>
> Example 1)
> We hot-add 1GB on x86_64 (memory block 128MB) using
> MHP_MEMMAP_DEVICE:
>
> head->_refcount = 8 sections
> head->private = 4096 vmemmap pages
> tail's->freelist = head
>
> Example 2)
> We hot-add 1GB on x86_64 using MHP_MEMMAP_MEMBLOCK:
>
> [at the beginning of each memblock]
> head->_refcount = 1 section
> head->private = 512 vmemmap pages
> tail's->freelist = head
>
> We have the refcount because when using MHP_MEMMAP_DEVICE, we need to know
> how much do we have to defer the call to vmemmap_free().
> The thing is that the first pages of the hot-added range are used to create
> the memmap mapping, so we cannot remove those first, otherwise we would blow up
> when accessing the other pages.
>
> What we do is that since when we hot-remove a memory-range, sections are being
> removed sequentially, we wait until we hit the last section, and then we free
> the hole range to vmemmap_free backwards.
> We know that it is the last section because in every pass we
> decrease head->_refcount, and when it reaches 0, we got our last section.
>
> We also have to be careful about those pages during online and offline
> operations. They are simply skipped, so online will keep them
> reserved and so unusable for any other purpose and offline ignores them
> so they do not block the offline operation.
>
> In offline operation we only have to check for one particularity.
> Depending on how large was the hot-added range, and using MHP_MEMMAP_DEVICE,
> can be that one or more than one memory block is filled with only vmemmap pages.
> We just need to check for this case and skip 1) isolating 2) migrating,
> because those pages do not need to be migrated anywhere, they are self-hosted.
>
> Signed-off-by: Oscar Salvador <osalvador@xxxxxxx>
> ---
> arch/arm64/mm/mmu.c | 5 +-
> arch/powerpc/mm/init_64.c | 7 +++
> arch/s390/mm/init.c | 6 ++
> arch/x86/mm/init_64.c | 10 +++
> drivers/acpi/acpi_memhotplug.c | 2 +-
> drivers/base/memory.c | 2 +-
> include/linux/memory_hotplug.h | 6 ++
> include/linux/memremap.h | 2 +-
> mm/compaction.c | 7 +++
> mm/memory_hotplug.c | 138 +++++++++++++++++++++++++++++++++++------
> mm/page_alloc.c | 22 ++++++-
> mm/page_isolation.c | 14 ++++-
> mm/sparse.c | 93 +++++++++++++++++++++++++++
> 13 files changed, 289 insertions(+), 25 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 93ed0df4df79..d4b5661fa6b6 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -765,7 +765,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> if (pmd_none(READ_ONCE(*pmdp))) {
> void *p = NULL;
>
> - p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> + if (altmap)
> + p = altmap_alloc_block_buf(PMD_SIZE, altmap);
> + else
> + p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> if (!p)
> return -ENOMEM;
>
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index a4e17a979e45..ff9d2c245321 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -289,6 +289,13 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
>
> if (base_pfn >= alt_start && base_pfn < alt_end) {
> vmem_altmap_free(altmap, nr_pages);
> + } else if (PageVmemmap(page)) {
> + /*
> + * runtime vmemmap pages are residing inside the memory
> + * section so they do not have to be freed anywhere.
> + */
> + while (PageVmemmap(page))
> + __ClearPageVmemmap(page++);
> } else if (PageReserved(page)) {
> /* allocated from bootmem */
> if (page_size < PAGE_SIZE) {
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index ffb81fe95c77..c045411552a3 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -226,6 +226,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
> unsigned long size_pages = PFN_DOWN(size);
> int rc;
>
> + /*
> + * Physical memory is added only later during the memory online so we
> + * cannot use the added range at this stage unfortunately.
> + */
> + restrictions->flags &= ~restrictions->flags;
> +
> if (WARN_ON_ONCE(restrictions->altmap))
> return -EINVAL;
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 688fb0687e55..00d17b666337 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -874,6 +874,16 @@ static void __meminit free_pagetable(struct page *page, int order)
> unsigned long magic;
> unsigned int nr_pages = 1 << order;
>
> + /*
> + * Runtime vmemmap pages are residing inside the memory section so
> + * they do not have to be freed anywhere.
> + */
> + if (PageVmemmap(page)) {
> + while (nr_pages--)
> + __ClearPageVmemmap(page++);
> + return;
> + }
> +
> /* bootmem page has reserved flag */
> if (PageReserved(page)) {
> __ClearPageReserved(page);
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 860f84e82dd0..3257edb98d90 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -218,7 +218,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
> if (node < 0)
> node = memory_add_physaddr_to_nid(info->start_addr);
>
> - result = __add_memory(node, info->start_addr, info->length, 0);
> + result = __add_memory(node, info->start_addr, info->length, MHP_MEMMAP_DEVICE);
>
> /*
> * If the memory block has been used by the kernel, add_memory()
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index ad9834b8b7f7..e0ac9a3b66f8 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -32,7 +32,7 @@ static DEFINE_MUTEX(mem_sysfs_mutex);
>
> #define to_memory_block(dev) container_of(dev, struct memory_block, dev)
>
> -static int sections_per_block;
> +int sections_per_block;
>
> static inline int base_memory_block_id(int section_nr)
> {
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 6fdbce9d04f9..e28e226c9a20 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -375,4 +375,10 @@ extern bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_
> int online_type);
> extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
> unsigned long nr_pages);
> +
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +extern void mark_vmemmap_pages(struct vmem_altmap *self);
> +#else
> +static inline void mark_vmemmap_pages(struct vmem_altmap *self) {}
> +#endif
> #endif /* __LINUX_MEMORY_HOTPLUG_H */
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 1732dea030b2..6de37e168f57 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -16,7 +16,7 @@ struct device;
> * @alloc: track pages consumed, private to vmemmap_populate()
> */
> struct vmem_altmap {
> - const unsigned long base_pfn;
> + unsigned long base_pfn;
> const unsigned long reserve;
> unsigned long free;
> unsigned long align;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 9e1b9acb116b..40697f74b8b4 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -855,6 +855,13 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> nr_scanned++;
>
> page = pfn_to_page(low_pfn);
> + /*
> + * Vmemmap pages do not need to be isolated.
> + */
> + if (PageVmemmap(page)) {
> + low_pfn += get_nr_vmemmap_pages(page) - 1;
> + continue;
> + }
>
> /*
> * Check if the pageblock has already been marked skipped.
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e4e3baa6eaa7..b5106cb75795 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -42,6 +42,8 @@
> #include "internal.h"
> #include "shuffle.h"
>
> +extern int sections_per_block;
> +
> /*
> * online_page_callback contains pointer to current page onlining function.
> * Initially it is generic_online_page(). If it is required it could be
> @@ -279,6 +281,24 @@ static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
> return 0;
> }
>
> +static void mhp_reset_altmap(unsigned long next_pfn,
> + struct vmem_altmap *altmap)
> +{
> + altmap->base_pfn = next_pfn;
> + altmap->alloc = 0;
> +}
> +
> +static void mhp_init_altmap(unsigned long pfn, unsigned long nr_pages,
> + unsigned long mhp_flags,
> + struct vmem_altmap *altmap)
> +{
> + if (mhp_flags & MHP_MEMMAP_DEVICE)
> + altmap->free = nr_pages;
> + else
> + altmap->free = PAGES_PER_SECTION * sections_per_block;
> + altmap->base_pfn = pfn;
> +}
> +
> /*
> * Reasonably generic function for adding memory. It is
> * expected that archs that support memory hotplug will
> @@ -290,8 +310,17 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
> {
> unsigned long i;
> int start_sec, end_sec, err;
> - struct vmem_altmap *altmap = restrictions->altmap;
> + struct vmem_altmap *altmap;
> + struct vmem_altmap __memblk_altmap = {};
> + unsigned long mhp_flags = restrictions->flags;
> + unsigned long sections_added;
> +
> + if (mhp_flags & MHP_VMEMMAP_FLAGS) {
> + mhp_init_altmap(pfn, nr_pages, mhp_flags, &__memblk_altmap);
> + restrictions->altmap = &__memblk_altmap;
> + }
>
> + altmap = restrictions->altmap;
> if (altmap) {
> /*
> * Validate altmap is within bounds of the total request
> @@ -308,9 +337,10 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
> if (err)
> return err;
>
> + sections_added = 1;
> start_sec = pfn_to_section_nr(pfn);
> end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
> - for (i = start_sec; i <= end_sec; i++) {
> + for (i = start_sec; i <= end_sec; i++, sections_added++) {
> unsigned long pfns;
>
> pfns = min(nr_pages, PAGES_PER_SECTION
> @@ -320,9 +350,19 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
> break;
> pfn += pfns;
> nr_pages -= pfns;
> +
> + if (mhp_flags & MHP_MEMMAP_MEMBLOCK &&
> + !(sections_added % sections_per_block)) {
> + mark_vmemmap_pages(altmap);
> + mhp_reset_altmap(pfn, altmap);
> + }
> cond_resched();
> }
> vmemmap_populate_print_last();
> +
> + if (mhp_flags & MHP_MEMMAP_DEVICE)
> + mark_vmemmap_pages(altmap);
> +
> return err;
> }
>
> @@ -642,6 +682,14 @@ static int online_pages_blocks(unsigned long start, unsigned long nr_pages)
> while (start < end) {
> order = min(MAX_ORDER - 1,
> get_order(PFN_PHYS(end) - PFN_PHYS(start)));
> + /*
> + * Check if the pfn is aligned to its order.
> + * If not, we decrement the order until it is,
> + * otherwise __free_one_page will bug us.
> + */
> + while (start & ((1 << order) - 1))
> + order--;
> +
> (*online_page_callback)(pfn_to_page(start), order);
>
> onlined_pages += (1UL << order);
> @@ -654,13 +702,30 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
> void *arg)
> {
> unsigned long onlined_pages = *(unsigned long *)arg;
> + unsigned long pfn = start_pfn;
> + unsigned long nr_vmemmap_pages = 0;
>
> - if (PageReserved(pfn_to_page(start_pfn)))
> - onlined_pages += online_pages_blocks(start_pfn, nr_pages);
> + if (PageVmemmap(pfn_to_page(pfn))) {
> + /*
> + * Do not send vmemmap pages to the page allocator.
> + */
> + nr_vmemmap_pages = get_nr_vmemmap_pages(pfn_to_page(start_pfn));
> + nr_vmemmap_pages = min(nr_vmemmap_pages, nr_pages);
> + pfn += nr_vmemmap_pages;
> + if (nr_vmemmap_pages == nr_pages)
> + /*
> + * If the entire range contains only vmemmap pages,
> + * there are no pages left for the page allocator.
> + */
> + goto skip_online;
> + }
>
> + if (PageReserved(pfn_to_page(pfn)))
> + onlined_pages += online_pages_blocks(pfn, nr_pages - nr_vmemmap_pages);
> +skip_online:
> online_mem_sections(start_pfn, start_pfn + nr_pages);
>
> - *(unsigned long *)arg = onlined_pages;
> + *(unsigned long *)arg = onlined_pages + nr_vmemmap_pages;
> return 0;
> }
>
> @@ -1051,6 +1116,23 @@ static int online_memory_block(struct memory_block *mem, void *arg)
> return device_online(&mem->dev);
> }
>
> +static bool mhp_check_correct_flags(unsigned long flags)
> +{
> + if (flags & MHP_VMEMMAP_FLAGS) {
> + if (!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) {
> + WARN(1, "Vmemmap capability can only be used on"
> + "CONFIG_SPARSEMEM_VMEMMAP. Ignoring flags.\n");
> + return false;
> + }
> + if ((flags & MHP_VMEMMAP_FLAGS) == MHP_VMEMMAP_FLAGS) {
> + WARN(1, "Both MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK"
> + "were passed. Ignoring flags.\n");
> + return false;
> + }
> + }
> + return true;
> +}
> +
> /*
> * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> * and online/offline operations (triggered e.g. by sysfs).
> @@ -1086,6 +1168,9 @@ int __ref add_memory_resource(int nid, struct resource *res, unsigned long flags
> goto error;
> new_node = ret;
>
> + if (mhp_check_correct_flags(flags))
> + restrictions.flags = flags;
> +
> /* call arch's memory hotadd */
> ret = arch_add_memory(nid, start, size, &restrictions);
> if (ret < 0)
> @@ -1518,12 +1603,14 @@ static int __ref __offline_pages(unsigned long start_pfn,
> {
> unsigned long pfn, nr_pages;
> unsigned long offlined_pages = 0;
> + unsigned long nr_vmemmap_pages = 0;
> int ret, node, nr_isolate_pageblock;
> unsigned long flags;
> unsigned long valid_start, valid_end;
> struct zone *zone;
> struct memory_notify arg;
> char *reason;
> + bool skip = false;
>
> mem_hotplug_begin();
>
> @@ -1540,15 +1627,24 @@ static int __ref __offline_pages(unsigned long start_pfn,
> node = zone_to_nid(zone);
> nr_pages = end_pfn - start_pfn;
>
> - /* set above range as isolated */
> - ret = start_isolate_page_range(start_pfn, end_pfn,
> - MIGRATE_MOVABLE,
> - SKIP_HWPOISON | REPORT_FAILURE);
> - if (ret < 0) {
> - reason = "failure to isolate range";
> - goto failed_removal;
> + if (PageVmemmap(pfn_to_page(start_pfn))) {
> + nr_vmemmap_pages = get_nr_vmemmap_pages(pfn_to_page(start_pfn));
> + nr_vmemmap_pages = min(nr_vmemmap_pages, nr_pages);
> + if (nr_vmemmap_pages == nr_pages)
> + skip = true;
> + }
> +
> + if (!skip) {
> + /* set above range as isolated */
> + ret = start_isolate_page_range(start_pfn, end_pfn,
> + MIGRATE_MOVABLE,
> + SKIP_HWPOISON | REPORT_FAILURE);
> + if (ret < 0) {
> + reason = "failure to isolate range";
> + goto failed_removal;
> + }
> + nr_isolate_pageblock = ret;
> }
> - nr_isolate_pageblock = ret;
>
> arg.start_pfn = start_pfn;
> arg.nr_pages = nr_pages;
> @@ -1561,6 +1657,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
> goto failed_removal_isolated;
> }
>
> + if (skip)
> + goto skip_migration;
> +
> do {
> for (pfn = start_pfn; pfn;) {
> if (signal_pending(current)) {
> @@ -1601,7 +1700,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
> We cannot do rollback at this point. */
> walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> &offlined_pages, offline_isolated_pages_cb);
> - pr_info("Offlined Pages %ld\n", offlined_pages);
> +
> +skip_migration:
> + pr_info("Offlined Pages %ld\n", offlined_pages + nr_vmemmap_pages);
> /*
> * Onlining will reset pagetype flags and makes migrate type
> * MOVABLE, so just need to decrease the number of isolated
> @@ -1612,11 +1713,12 @@ static int __ref __offline_pages(unsigned long start_pfn,
> spin_unlock_irqrestore(&zone->lock, flags);
>
> /* removal success */
> - adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
> - zone->present_pages -= offlined_pages;
> + if (offlined_pages)
> + adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
> + zone->present_pages -= offlined_pages + nr_vmemmap_pages;
>
> pgdat_resize_lock(zone->zone_pgdat, &flags);
> - zone->zone_pgdat->node_present_pages -= offlined_pages;
> + zone->zone_pgdat->node_present_pages -= offlined_pages + nr_vmemmap_pages;
> pgdat_resize_unlock(zone->zone_pgdat, &flags);
>
> init_per_zone_wmark_min();
> @@ -1645,7 +1747,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
> memory_notify(MEM_CANCEL_OFFLINE, &arg);
> failed_removal:
> pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
> - (unsigned long long) start_pfn << PAGE_SHIFT,
> + (unsigned long long) (start_pfn - nr_vmemmap_pages) << PAGE_SHIFT,
> ((unsigned long long) end_pfn << PAGE_SHIFT) - 1,
> reason);
> /* pushback to free area */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5b3266d63521..7a73a06c5730 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1282,9 +1282,14 @@ static void free_one_page(struct zone *zone,
> static void __meminit __init_single_page(struct page *page, unsigned long pfn,
> unsigned long zone, int nid)
> {
> - mm_zero_struct_page(page);
> + if (!__PageVmemmap(page)) {
> + /*
> + * Vmemmap pages need to preserve their state.
> + */
> + mm_zero_struct_page(page);
> + init_page_count(page);
> + }
> set_page_links(page, zone, nid, pfn);
> - init_page_count(page);
> page_mapcount_reset(page);
> page_cpupid_reset_last(page);
> page_kasan_tag_reset(page);
> @@ -8143,6 +8148,14 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>
> page = pfn_to_page(check);
>
> + /*
> + * Vmemmap pages are not needed to be moved around.
> + */
> + if (PageVmemmap(page)) {
> + iter += get_nr_vmemmap_pages(page) - 1;
> + continue;
> + }
> +
> if (PageReserved(page))
> goto unmovable;
>
> @@ -8510,6 +8523,11 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
> continue;
> }
> page = pfn_to_page(pfn);
> +
> + if (PageVmemmap(page)) {
> + pfn += get_nr_vmemmap_pages(page);
> + continue;
> + }
> /*
> * The HWPoisoned page may be not in buddy system, and
> * page_count() is not 0.
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index e3638a5bafff..128c47a27925 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -146,7 +146,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
> static inline struct page *
> __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> {
> - int i;
> + unsigned long i;
>
> for (i = 0; i < nr_pages; i++) {
> struct page *page;
> @@ -154,6 +154,10 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> page = pfn_to_online_page(pfn + i);
> if (!page)
> continue;
> + if (PageVmemmap(page)) {
> + i += get_nr_vmemmap_pages(page) - 1;
> + continue;
> + }
> return page;
> }
> return NULL;
> @@ -268,6 +272,14 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
> continue;
> }
> page = pfn_to_page(pfn);
> + /*
> + * Vmemmap pages are not isolated. Skip them.
> + */
> + if (PageVmemmap(page)) {
> + pfn += get_nr_vmemmap_pages(page);
> + continue;
> + }
> +
> if (PageBuddy(page))
> /*
> * If the page is on a free list, it has to be on
> diff --git a/mm/sparse.c b/mm/sparse.c
> index b77ca21a27a4..04b395fb4463 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -635,6 +635,94 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
> #endif
>
> #ifdef CONFIG_SPARSEMEM_VMEMMAP
> +void mark_vmemmap_pages(struct vmem_altmap *self)
> +{
> + unsigned long pfn = self->base_pfn + self->reserve;
> + unsigned long nr_pages = self->alloc;
> + unsigned long nr_sects = self->free / PAGES_PER_SECTION;
> + unsigned long i;
> + struct page *head;
> +
> + if (!nr_pages)
> + return;
> +
> + pr_debug("%s: marking %px - %px as Vmemmap (%ld pages)\n",
> + __func__,
> + pfn_to_page(pfn),
> + pfn_to_page(pfn + nr_pages - 1),
> + nr_pages);
> +
> + /*
> + * All allocations for the memory hotplug are the same sized so align
> + * should be 0.
> + */
> + WARN_ON(self->align);
> +
> + /*
> + * Layout of vmemmap pages:
> + * [Head->refcount] : Nr sections used by this altmap
> + * [Head->private] : Nr of vmemmap pages
> + * [Tail->freelist] : Pointer to the head page
> + */
> +
> + /*
> + * Head, first vmemmap page
> + */
> + head = pfn_to_page(pfn);
> + for (i = 0; i < nr_pages; i++, pfn++) {
> + struct page *page = pfn_to_page(pfn);
> +
> + mm_zero_struct_page(page);
> + __SetPageVmemmap(page);
> + page->freelist = head;
> + init_page_count(page);
> + }
> + set_page_count(head, (int)nr_sects);
> + set_page_private(head, nr_pages);
> +}
> +/*
> + * If the range we are trying to remove was hot-added with vmemmap pages
> + * using MHP_MEMMAP_DEVICE, we need to keep track of it to know how much
> + * do we have do defer the free up.
> + * Since sections are removed sequentally in __remove_pages()->
> + * __remove_section(), we just wait until we hit the last section.
> + * Once that happens, we can trigger free_deferred_vmemmap_range to actually
> + * free the whole memory-range.
> + */
> +static struct page *head_vmemmap_page = NULL;;
> +static bool freeing_vmemmap_range = false;
> +
> +static inline bool vmemmap_dec_and_test(void)
> +{
> + return page_ref_dec_and_test(head_vmemmap_page);
> +}
> +
> +static void free_deferred_vmemmap_range(unsigned long start,
> + unsigned long end)
> +{
> + unsigned long nr_pages = end - start;
> + unsigned long first_section = (unsigned long)head_vmemmap_page;
> +
> + while (start >= first_section) {
> + vmemmap_free(start, end, NULL);
> + end = start;
> + start -= nr_pages;
> + }
> + head_vmemmap_page = NULL;
> + freeing_vmemmap_range = false;
> +}
> +
> +static void deferred_vmemmap_free(unsigned long start, unsigned long end)
> +{
> + if (!freeing_vmemmap_range) {
> + freeing_vmemmap_range = true;
> + head_vmemmap_page = (struct page *)start;
> + }
> +
> + if (vmemmap_dec_and_test())
> + free_deferred_vmemmap_range(start, end);
> +}
> +
> static struct page *populate_section_memmap(unsigned long pfn,
> unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
> {
> @@ -647,6 +735,11 @@ static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
> unsigned long start = (unsigned long) pfn_to_page(pfn);
> unsigned long end = start + nr_pages * sizeof(struct page);
>
> + if (PageVmemmap((struct page *)start) || freeing_vmemmap_range) {
> + deferred_vmemmap_free(start, end);
> + return;
> + }
> +
> vmemmap_free(start, end, altmap);
> }
> static void free_map_bootmem(struct page *memmap)
>


--

Thanks,

David / dhildenb