Re: [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range

From: David Hildenbrand
Date: Mon Jan 25 2021 - 06:18:02 EST


On 25.01.21 11:56, Oscar Salvador wrote:
> On Mon, Jan 25, 2021 at 11:39:55AM +0100, Oscar Salvador wrote:
>>> Interresting, so we automatically support differeing sizeof(struct
>>> page). I guess it will be problematic in case of sizeof(struct page) !=
>>> 64, because then, we might not have multiples of 2MB for the memmap of a
>>> memory block.
>>>
>>> IIRC, it could happen that if we add two consecutive memory blocks, that
>>> the second one might reuse parts of the vmemmap residing on the first
>>> memory block. If you remove the first one, you might be in trouble.
>>>
>>> E.g., on x86-64
>>> vmemmap_populate()->vmemmap_populate_hugepages()->vmemmap_alloc_block_buf():
>>> - Populate a huge page
>>>
>>> vmemmap_free()->remove_pagetable()...->remove_pmd_table():
>>> - memchr_inv() will leave the hugepage populated.
>>>
>>> Do we want to fence that off, maybe in mhp_supports_memmap_on_memory()?
>>> Or do we somehow want to fix that? We should never populate partial huge
>>> pages from an altmap ...
>>>
>>> But maybe I am missing something.
>>
>> No, you are not missing anything.
>>
>> I think that remove_pmd_table() should be fixed.
>> vmemmap_populate_hugepages() allocates PMD_SIZE chunk and marks the PMD as
>> PAGE_KERNEL_LARGE, so when remove_pmd_table() sees that 1) the PMD
>> is large and 2) the chunk is not aligned, the memset() should be writing
>> PAGE_INUSE for a PMD chunk.
>>
>> I do not really a see a reason why this should not happen.
>> Actually, we will be leaving pagetables behind as we never get to free
>> the PMD chunk when sizeof(struct page) is not a multiple of 2MB.
>>
>> I plan to fix remove_pmd_table(), but for now let us not allow to use this feature
>> if the size of a struct page is not 64.
>> Let us keep it simply for the time being, shall we?
>
> My first intention was:
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index b5a3fa4033d3..0c9756a2eb24 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1044,32 +1044,14 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
> continue;
>
> if (pmd_large(*pmd)) {
> - if (IS_ALIGNED(addr, PMD_SIZE) &&
> - IS_ALIGNED(next, PMD_SIZE)) {
> - if (!direct)
> - free_hugepage_table(pmd_page(*pmd),
> - altmap);
> -
> - spin_lock(&init_mm.page_table_lock);
> - pmd_clear(pmd);
> - spin_unlock(&init_mm.page_table_lock);
> - pages++;
> - } else {
> - /* If here, we are freeing vmemmap pages. */
> - memset((void *)addr, PAGE_INUSE, next - addr);
> -
> - page_addr = page_address(pmd_page(*pmd));
> - if (!memchr_inv(page_addr, PAGE_INUSE,
> - PMD_SIZE)) {
> - free_hugepage_table(pmd_page(*pmd),
> - altmap);
> -
> - spin_lock(&init_mm.page_table_lock);
> - pmd_clear(pmd);
> - spin_unlock(&init_mm.page_table_lock);
> - }
> - }
> + if (!direct)
> + free_hugepage_table(pmd_page(*pmd),
> + altmap);
>
> + spin_lock(&init_mm.page_table_lock);
> + pmd_clear(pmd);
> + spin_unlock(&init_mm.page_table_lock);
> + pages++;
> continue;
> }
>
> I did not try it out yet and this might explode badly, but AFAICS, a PMD size
> chunk is always allocated even when the section does not spand 2MB.
> E.g: sizeof(struct page) = 56.
>
> PAGES_PER_SECTION * 64 = 2097152
> PAGES_PER_SECTION * 56 = 1835008
>
> Even in the latter case, vmemmap_populate_hugepages will allocate a PMD size chunk
> to satisfy the unaligned range.
> So, unless I am missing something, it should not be a problem to free that 2MB in
> remove_pmd_table when 1) the PMD is large and 2) the range is not aligned.

Assume you have two consecutive memory blocks with 56 sizeof(struct page).
The first one allocates a PMD (2097152) but only consumes 1835008, the second
one reuses the remaining part and allocates another PMD (1835008),
only using parts of it.

Ripping out a memory block, along with the PMD in the vmemmap would
remove parts of the vmemmap of another memory block.

You might want to take a look at:

commit 2c114df071935762ffa88144cdab03d84beaa702
Author: David Hildenbrand <david@xxxxxxxxxx>
Date: Wed Jul 22 11:45:58 2020 +0200

s390/vmemmap: avoid memset(PAGE_UNUSED) when adding consecutive sections

Let's avoid memset(PAGE_UNUSED) when adding consecutive sections,
whereby the vmemmap of a single section does not span full PMDs.

Cc: Vasily Gorbik <gor@xxxxxxxxxxxxx>
Cc: Christian Borntraeger <borntraeger@xxxxxxxxxx>
Cc: Gerald Schaefer <gerald.schaefer@xxxxxxxxxx>
Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
Message-Id: <20200722094558.9828-10-david@xxxxxxxxxx>
Signed-off-by: Heiko Carstens <hca@xxxxxxxxxxxxx>

commit cd5781d63eaf6dbf89532d8c7c214786b767ee16
Author: David Hildenbrand <david@xxxxxxxxxx>
Date: Wed Jul 22 11:45:57 2020 +0200

s390/vmemmap: remember unused sub-pmd ranges

With a memmap size of 56 bytes or 72 bytes per page, the memmap for a
256 MB section won't span full PMDs. As we populate single sections and
depopulate single sections, the depopulation step would not be able to
free all vmemmap pmds anymore.

Do it similarly to x86, marking the unused memmap ranges in a special way
(pad it with 0xFD).

This allows us to add/remove sections, cleaning up all allocated
vmemmap pages even if the memmap size is not multiple of 16 bytes per page.

A 56 byte memmap can, for example, be created with !CONFIG_MEMCG and
!CONFIG_SLUB.

Cc: Vasily Gorbik <gor@xxxxxxxxxxxxx>
Cc: Christian Borntraeger <borntraeger@xxxxxxxxxx>
Cc: Gerald Schaefer <gerald.schaefer@xxxxxxxxxx>
Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
Message-Id: <20200722094558.9828-9-david@xxxxxxxxxx>
Signed-off-by: Heiko Carstens <hca@xxxxxxxxxxxxx>


But again, the root issue I see is that we can play such games with anonymous pages, but not with
memory residing in the altmap space.

--
Thanks,

David / dhildenb