Re: [RFC PATCH 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap

From: Dave Hansen
Date: Fri Nov 16 2018 - 17:41:30 EST


On 11/16/18 2:12 AM, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, kmalloc is used for those
> allocations.

Did you literally mean kmalloc? I thought we had a bunch of ways of
allocating memmaps, but I didn't think kmalloc() was actually used.

Like vmemmap_alloc_block(), for instance, uses alloc_pages_node().

So, can the ZONE_DEVICE altmaps move over to this infrastructure?
Doesn't this effectively duplicate that code?

...
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index 7a9886f98b0c..03f014abd4eb 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -278,6 +278,8 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
> continue;
>
> page = pfn_to_page(addr >> PAGE_SHIFT);
> + if (PageVmemmap(page))
> + continue;
> section_base = pfn_to_page(vmemmap_section_start(start));
> nr_pages = 1 << page_order;

Reading this, I'm wondering if PageVmemmap() could be named better.
>From this is reads like "skip PageVmemmap() pages if freeing vmemmap",
which does not make much sense.

This probably at _least_ needs a comment to explain why the pages are
being skipped.

> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index 4139affd6157..bc1523bcb09d 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -231,6 +231,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
> unsigned long size_pages = PFN_DOWN(size);
> int rc;
>
> + /*
> + * Physical memory is added only later during the memory online so we
> + * cannot use the added range at this stage unfortunatelly.

unfortunately ^

> + */
> + restrictions->flags &= ~MHP_MEMMAP_FROM_RANGE;

Could you also add to the comment about this being specific to s390?

> rc = vmem_add_mapping(start, size);
> if (rc)
> return rc;
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index fd06bcbd9535..d5234ca5c483 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -815,6 +815,13 @@ static void __meminit free_pagetable(struct page *page, int order)
> unsigned long magic;
> unsigned int nr_pages = 1 << order;
>
> + /*
> + * runtime vmemmap pages are residing inside the memory section so
> + * they do not have to be freed anywhere.
> + */
> + if (PageVmemmap(page))
> + return;

Thanks for the comment on this one, this one is right on.

> @@ -16,13 +18,18 @@ struct device;
> * @free: free pages set aside in the mapping for memmap storage
> * @align: pages reserved to meet allocation alignments
> * @alloc: track pages consumed, private to vmemmap_populate()
> + * @flush_alloc_pfns: callback to be called on the allocated range after it
> + * @nr_sects: nr of sects filled with memmap allocations
> + * is mapped to the vmemmap - see mark_vmemmap_pages
> */

I think you split up the "@flush_alloc_pfns" comment accidentally.

> struct vmem_altmap {
> - const unsigned long base_pfn;
> + unsigned long base_pfn;
> const unsigned long reserve;
> unsigned long free;
> unsigned long align;
> unsigned long alloc;
> + int nr_sects;
> + void (*flush_alloc_pfns)(struct vmem_altmap *self);
> };
>
> /*
> @@ -133,8 +140,62 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
> struct dev_pagemap *pgmap);
>
> -unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
> +static inline unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
> +{
> + /* number of pfns from base where pfn_to_page() is valid */
> + return altmap->reserve + altmap->free;
> +}
> void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
> +
> +static inline void mark_vmemmap_pages(struct vmem_altmap *self)
> +{
> + unsigned long pfn = self->base_pfn + self->reserve;
> + unsigned long nr_pages = self->alloc;
> + unsigned long align = PAGES_PER_SECTION * sizeof(struct page);
> + struct page *head;
> + unsigned long i;
> +
> + pr_debug("%s: marking %px - %px as Vmemmap\n", __func__,
> + pfn_to_page(pfn),
> + pfn_to_page(pfn + nr_pages - 1));
> + /*
> + * We keep track of the sections using this altmap by means
> + * of a refcount, so we know how much do we have to defer
> + * the call to vmemmap_free for this memory range.
> + * The refcount is kept in the first vmemmap page.
> + * For example:
> + * We add 10GB: (ffffea0004000000 - ffffea000427ffc0)
> + * ffffea0004000000 will have a refcount of 80.
> + */

The example is good, but it took me a minute to realize that 80 is
because 10GB is roughly 80 sections.

> + head = (struct page *)ALIGN_DOWN((unsigned long)pfn_to_page(pfn), align);

Is this ALIGN_DOWN() OK? It seems like it might be aligning 'pfn' down
into the reserved are that lies before it.

> + head = (struct page *)((unsigned long)head - (align * self->nr_sects));
> + page_ref_inc(head);
> +
> + /*
> + * We have used a/another section only with memmap allocations.
> + * We need to keep track of it in order to get the first head page
> + * to increase its refcount.
> + * This makes it easier to compute.
> + */
> + if (!((page_ref_count(head) * PAGES_PER_SECTION) % align))
> + self->nr_sects++;
> +
> + /*
> + * All allocations for the memory hotplug are the same sized so align
> + * should be 0
> + */
> + WARN_ON(self->align);
> + for (i = 0; i < nr_pages; i++, pfn++) {
> + struct page *page = pfn_to_page(pfn);
> + __SetPageVmemmap(page);
> + init_page_count(page);
> + }

Looks like some tabs vs. space problems.

> + self->alloc = 0;
> + self->base_pfn += nr_pages + self->reserve;
> + self->free -= nr_pages;
> +}
> #else
> static inline void *devm_memremap_pages(struct device *dev,
> struct dev_pagemap *pgmap)
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 50ce1bddaf56..e79054fcc96e 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -437,6 +437,24 @@ static __always_inline int __PageMovable(struct page *page)
> PAGE_MAPPING_MOVABLE;
> }
>
> +#define VMEMMAP_PAGE ~PAGE_MAPPING_FLAGS
> +static __always_inline int PageVmemmap(struct page *page)
> +{
> + return PageReserved(page) && (unsigned long)page->mapping == VMEMMAP_PAGE;
> +}
> +
> +static __always_inline void __ClearPageVmemmap(struct page *page)
> +{
> + ClearPageReserved(page);
> + page->mapping = NULL;
> +}
> +
> +static __always_inline void __SetPageVmemmap(struct page *page)
> +{
> + SetPageReserved(page);
> + page->mapping = (void *)VMEMMAP_PAGE;
> +}

Just curious, but why are these __ versions? I thought we used that for
non-atomic bit operations, but this uses the atomic SetPageReserved().

> diff --git a/mm/compaction.c b/mm/compaction.c
> index 7c607479de4a..c94a480e01b5 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -768,6 +768,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>
> page = pfn_to_page(low_pfn);
>
> + if (PageVmemmap(page))
> + goto isolate_fail;

Comments, please.

...
> +static int __online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
> +{
> + if (PageReserved(pfn_to_page(start_pfn)))
> + return online_pages_blocks(start_pfn, nr_pages);
> +
> + return 0;
> +}

Why is it important that 'start_pfn' is PageReserved()?

> static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
> - void *arg)
> + void *arg)
> {
> unsigned long onlined_pages = *(unsigned long *)arg;
> + unsigned long pfn = start_pfn;
> + unsigned long end_pfn = start_pfn + nr_pages;
> + bool vmemmap_page = false;
>
> - if (PageReserved(pfn_to_page(start_pfn)))
> - onlined_pages = online_pages_blocks(start_pfn, nr_pages);
> + for (; pfn < end_pfn; pfn++) {
> + struct page *p = pfn_to_page(pfn);
> +
> + /*
> + * Let us check if we got vmemmap pages.
> + */
> + if (PageVmemmap(p)) {
> + vmemmap_page = true;
> + break;
> + }
> + }

OK, so we did the hot-add, and allocated some of the memmap[] inside the
area being hot-added. Now, we're onlining the page. We search every
page in the *entire* area being onlined to try to find a PageVmemmap()?
That seems a bit inefficient, especially for sections where we don't
have a PageVmemmap().

> + if (!vmemmap_page) {
> + /*
> + * We can send the whole range to __online_pages_range,
> + * as we know for sure that there are not vmemmap pages.
> + */
> + onlined_pages += __online_pages_range(start_pfn, nr_pages);
> + } else {
> + /*
> + * We need to strip the vmemmap pages here,
> + * as we do not want to send them to the buddy allocator.
> + */
> + unsigned long sections = nr_pages / PAGES_PER_SECTION;
> + unsigned long sect_nr = 0;
> +
> + for (; sect_nr < sections; sect_nr++) {
> + unsigned pfn_end_section;
> + unsigned long memmap_pages = 0;
> +
> + pfn = start_pfn + (PAGES_PER_SECTION * sect_nr);
> + pfn_end_section = pfn + PAGES_PER_SECTION;
> +
> + while (pfn < pfn_end_section) {
> + struct page *p = pfn_to_page(pfn);
> +
> + if (PageVmemmap(p))
> + memmap_pages++;
> + pfn++;
> + }

... and another lienar search through the entire area being added.

> + pfn = start_pfn + (PAGES_PER_SECTION * sect_nr);
> + if (!memmap_pages) {
> + onlined_pages += __online_pages_range(pfn, PAGES_PER_SECTION);

If I read this right, this if() and the first block are unneeded. The
second block is funcationally identical if memmap_pages==0. Seems like
we can simplify the code. Also, is this _really_ under 80 columns?
Seems kinda long.

> + if (PageVmemmap(page))
> + continue;

FWIW, all these random-looking PageVmemmap() calls are a little
worrying. What happens when we get one wrong? Seems like we're kinda
bringing back all the PageReserved() checks we used to have scattered
all over.

> @@ -8138,6 +8146,16 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
> continue;
> }
> page = pfn_to_page(pfn);
> +
> + /*
> + * vmemmap pages are residing inside the memory section so
> + * they do not have to be freed anywhere.
> + */
> + if (PageVmemmap(page)) {
> + pfn++;
> + continue;
> + }


> +static struct page *current_vmemmap_page = NULL;
> +static bool vmemmap_dec_and_test(void)
> +{
> + bool ret = false;
> +
> + if (page_ref_dec_and_test(current_vmemmap_page))
> + ret = true;
> + return ret;
> +}

That's a bit of an obfuscated way to do:

return page_ref_dec_and_test(current_vmemmap_page));

:)

But, I also see a global variable, and this immediately makes me think
about locking and who "owns" this. Comments would help.

> +static void free_vmemmap_range(unsigned long limit, unsigned long start, unsigned long end)
> +{
> + unsigned long range_start;
> + unsigned long range_end;
> + unsigned long align = sizeof(struct page) * PAGES_PER_SECTION;
> +
> + range_end = end;
> + range_start = end - align;
> + while (range_start >= limit) {
> + vmemmap_free(range_start, range_end, NULL);
> + range_end = range_start;
> + range_start -= align;
> + }
> +}

This loop looks like it's working from the end of the range back to the
beginning. I guess that it works, but it's a bit unconventional to go
backwards. Was there a reason?

Overall, there's a lot of complexity here. This certainly doesn't make
the memory hotplug code simpler.