Re: [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges

From: Dave Hansen
Date: Fri Mar 05 2021 - 12:39:00 EST


On 3/1/21 12:32 AM, Oscar Salvador wrote:
> When the size of a struct page is not multiple of 2MB, sections do
> not span a PMD anymore and so when populating them some parts of the
> PMD will remain unused.

Multiples of 2MB are 2MB, 4MB, 6MB, etc...

I think you meant here that 2MB must be a multiple of the 'struct page'
size. I don't think there are any non-power-of-2 factors of 2MB, so I
think it's probably simpler and more accurate to say:

When sizeof(struct page) is not a power of 2...

I also don't think I realized that 2MB of 'struct page'
(2M/sizeof(struct page)=32k) fit perfectly into a 128MB section
(32k/64=128M).

> Because of this, PMDs will be left behind when depopulating sections
> since remove_pmd_table() thinks that those unused parts are still in
> use.
>
> Fix this by marking the unused parts with PAGE_UNUSED, so memchr_inv()
> will do the right thing and will let us free the PMD when the last user
> of it is gone.
>
> This patch is based on a similar patch by David Hildenbrand:
>
> https://lore.kernel.org/linux-mm/20200722094558.9828-9-david@xxxxxxxxxx/
> https://lore.kernel.org/linux-mm/20200722094558.9828-10-david@xxxxxxxxxx/
>
> Signed-off-by: Oscar Salvador <osalvador@xxxxxxx>
> Reviewed-by: David Hildenbrand <david@xxxxxxxxxx>
> ---
> arch/x86/mm/init_64.c | 106 ++++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 98 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 9ecb3c488ac8..7e8de63f02b3 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -871,7 +871,93 @@ int arch_add_memory(int nid, u64 start, u64 size,
> return add_pages(nid, start_pfn, nr_pages, params);
> }
>
> -#define PAGE_INUSE 0xFD
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +#define PAGE_UNUSED 0xFD
> +/*
> + * The unused vmemmap range, which was not yet memset(PAGE_UNUSED) ranges
> + * from unused_pmd_start to next PMD_SIZE boundary.
> + */
> +static unsigned long unused_pmd_start __meminitdata;

This whole 'unused_pmd_start' thing was unmentioned in the changelog.

I also kept reading this and thinking it was a 'pmd_t *', not a 'struct
page *'. The naming is rather unfortunate that way.

So, is this here so that the memset()s can be avoided? It's just an
optimization to say: "This is unused, but instead of marking it with
PAGE_UNUSED (which would be slow) I keep a pointer to it"?

> +static void __meminit vmemmap_flush_unused_pmd(void)
> +{
> + if (!unused_pmd_start)
> + return;
> + /*
> + * Clears (unused_pmd_start, PMD_END]
> + */
> + memset((void *)unused_pmd_start, PAGE_UNUSED,
> + ALIGN(unused_pmd_start, PMD_SIZE) - unused_pmd_start);
> + unused_pmd_start = 0;
> +}

Oh, and this means: "stop using the unused_pmd_start optimization. Just
memset the thing".

> +/* Returns true if the PMD is completely unused and thus it can be freed */
> +static bool __meminit vmemmap_unuse_sub_pmd(unsigned long addr, unsigned long end)
> +{
> + unsigned long start = ALIGN_DOWN(addr, PMD_SIZE);
> +
> + vmemmap_flush_unused_pmd();

That probably needs a comment like:

Flush the unused range cache to ensure that the memchr_inv()
will work for the whole range.

> + memset((void *)addr, PAGE_UNUSED, end - addr);
> +
> + return !memchr_inv((void *)start, PAGE_UNUSED, PMD_SIZE);
> +}

Also, logically, it would make a lot of sense if you can move the actual
PMD freeing logic in here. That way, the caller is just saying, "unuse
this PMD region", and then this takes care of the rest. As it stands,
it's a bit weird that the caller takes care of the freeing.

> +static void __meminit __vmemmap_use_sub_pmd(unsigned long start)
> +{
> + /*
> + * As we expect to add in the same granularity as we remove, it's
> + * sufficient to mark only some piece used to block the memmap page from
> + * getting removed when removing some other adjacent memmap (just in
> + * case the first memmap never gets initialized e.g., because the memory
> + * block never gets onlined).
> + */
> + memset((void *)start, 0, sizeof(struct page));
> +}
> +
> +static void __meminit vmemmap_use_sub_pmd(unsigned long start, unsigned long end)
> +{
> + /*
> + * We only optimize if the new used range directly follows the
> + * previously unused range (esp., when populating consecutive sections).
> + */
> + if (unused_pmd_start == start) {
> + if (likely(IS_ALIGNED(end, PMD_SIZE)))
> + unused_pmd_start = 0;
> + else
> + unused_pmd_start = end;
> + return;
> + }
> +
> + vmemmap_flush_unused_pmd();
> + __vmemmap_use_sub_pmd(start);
> +}
> +
> +static void __meminit vmemmap_use_new_sub_pmd(unsigned long start, unsigned long end)
> +{
> + vmemmap_flush_unused_pmd();
> +
> + /*
> + * Could be our memmap page is filled with PAGE_UNUSED already from a
> + * previous remove.
> + */
> + __vmemmap_use_sub_pmd(start);
> +
> + /*
> + * Mark the unused parts of the new memmap range
> + */
> + if (!IS_ALIGNED(start, PMD_SIZE))
> + memset((void *)start, PAGE_UNUSED,
> + start - ALIGN_DOWN(start, PMD_SIZE));
> + /*
> + * We want to avoid memset(PAGE_UNUSED) when populating the vmemmap of
> + * consecutive sections. Remember for the last added PMD the last
> + * unused range in the populated PMD.
> + */
> + if (!IS_ALIGNED(end, PMD_SIZE))
> + unused_pmd_start = end;
> +}
> +#endif
>
> static void __meminit free_pagetable(struct page *page, int order)
> {
> @@ -1006,7 +1092,6 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
> unsigned long next, pages = 0;
> pte_t *pte_base;
> pmd_t *pmd;
> - void *page_addr;
>
> pmd = pmd_start + pmd_index(addr);
> for (; addr < end; addr = next, pmd++) {
> @@ -1027,12 +1112,11 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
> spin_unlock(&init_mm.page_table_lock);
> pages++;
> } else {
> - /* If here, we are freeing vmemmap pages. */
> - memset((void *)addr, PAGE_INUSE, next - addr);
> -
> - page_addr = page_address(pmd_page(*pmd));
> - if (!memchr_inv(page_addr, PAGE_INUSE,
> - PMD_SIZE)) {
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> + /*
> + * Free the PMD if the whole range is unused.
> + */
> + if (vmemmap_unuse_sub_pmd(addr, next)) {
> free_hugepage_table(pmd_page(*pmd),
> altmap);
>
> @@ -1040,6 +1124,7 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
> pmd_clear(pmd);
> spin_unlock(&init_mm.page_table_lock);
> }
> +#endif
> }

This doesn't look like the world's longest if() statement, but it might
be nice to use the IS_ENABLED() syntax instead of an #ifdef. I suspect
the compiler could even make quick work of the static functions that
never get called as a result.

> continue;
> @@ -1492,11 +1577,16 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
>
> addr_end = addr + PMD_SIZE;
> p_end = p + PMD_SIZE;
> +
> + if (!IS_ALIGNED(addr, PMD_SIZE) ||
> + !IS_ALIGNED(next, PMD_SIZE))
> + vmemmap_use_new_sub_pmd(addr, next);
> continue;
> } else if (altmap)
> return -ENOMEM; /* no fallback */
> } else if (pmd_large(*pmd)) {
> vmemmap_verify((pte_t *)pmd, node, addr, next);
> + vmemmap_use_sub_pmd(addr, next);
> continue;
> }
> if (vmemmap_populate_basepages(addr, next, node, NULL))
>

This overall looks like a good thing to do. The implementation is even
pretty nice and simple. But, it took me an awfully long time to figure
out what was going on.

I wonder if you could take one more pass at these and especially see if
you can better explain the use of 'unused_pmd_start'.