Re: [PATCH v2 0/5] Allocate memmap from hotadded memory

From: David Hildenbrand
Date: Wed Jun 26 2019 - 04:11:26 EST


On 26.06.19 10:03, Oscar Salvador wrote:
> On Tue, Jun 25, 2019 at 10:25:48AM +0200, David Hildenbrand wrote:
>>> [Coverletter]
>>>
>>> This is another step to make memory hotplug more usable. The primary
>>> goal of this patchset is to reduce memory overhead of the hot-added
>>> memory (at least for SPARSEMEM_VMEMMAP memory model). The current way we use
>>> to populate memmap (struct page array) has two main drawbacks:
>
> First off, thanks for looking into this :-)

Thanks for working on this ;)

>
>>
>> Mental note: How will it be handled if a caller specifies "Allocate
>> memmap from hotadded memory", but we are running under SPARSEMEM where
>> we can't do this.
>
> In add_memory_resource(), we have a call to mhp_check_correct_flags(), which is
> in charge of checking if the flags passed are compliant with our configuration
> among other things.
> It also checks if both flags were passed (_MEMBLOCK|_DEVICE).
>
> If a) any of the flags were specified and we are not on CONFIG_SPARSEMEM_VMEMMAP,
> b) the flags are colliding with each other or c) the flags just do not make sense,
> we print out a warning and drop the flags to 0, so we just ignore them.
>
> I just realized that I can adjust the check even more (something for the next
> version).
>
> But to answer your question, flags are ignored under !CONFIG_SPARSEMEM_VMEMMAP.

So it is indeed a hint only.

>
>>
>>>
>>> a) it consumes an additional memory until the hotadded memory itself is
>>> onlined and
>>> b) memmap might end up on a different numa node which is especially true
>>> for movable_node configuration.
>>>
>>> a) it is a problem especially for memory hotplug based memory "ballooning"
>>> solutions when the delay between physical memory hotplug and the
>>> onlining can lead to OOM and that led to introduction of hacks like auto
>>> onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
>>> policy for the newly added memory")).
>>>
>>> b) can have performance drawbacks.
>>>
>>> Another minor case is that I have seen hot-add operations failing on archs
>>> because they were running out of order-x pages.
>>> E.g On powerpc, in certain configurations, we use order-8 pages,
>>> and given 64KB base pagesize, that is 16MB.
>>> If we run out of those, we just fail the operation and we cannot add
>>> more memory.
>>
>> At least for SPARSEMEM, we fallback to vmalloc() to work around this
>> issue. I haven't looked into the populate_section_memmap() internals
>> yet. Can you point me at the code that performs this allocation?
>
> Yes, on SPARSEMEM we first try to allocate the pages physical configuous, and
> then fallback to vmalloc.
> This is because on CONFIG_SPARSEMEM memory model, the translations pfn_to_page/
> page_to_pfn do not expect the memory to be contiguous.
>
> But that is not the case on CONFIG_SPARSEMEM_VMEMMAP.
> We do expect the memory to be physical contigous there, that is why a simply
> pfn_to_page/page_to_pfn is a matter of adding/substracting vmemmap/pfn.

Yeas, I explored that last week but didn't figure out where the actual
vmmap population code resided - thanks :)

>
> Powerpc code is at:
>
> https://elixir.bootlin.com/linux/v5.2-rc6/source/arch/powerpc/mm/init_64.c#L175
>
>
>
>> So, assuming we add_memory(1GB, MHP_MEMMAP_DEVICE) and then
>> remove_memory(128MB) of the added memory, this will work?
>
> No, MHP_MEMMAP_DEVICE is meant to be used when hot-adding and hot-removing work
> in the same granularity.
> This is because all memmap pages will be stored at the beginning of the memory
> range.
> Allowing hot-removing in a different granularity on MHP_MEMMAP_DEVICE would imply
> a lot of extra work.
> For example, we would have to parse the vmemmap-head of the hot-removed range,
> and punch a hole in there to clear the vmemmap pages, and then be very carefull
> when deleting those pagetables.
>
> So I followed Michal's advice, and I decided to let the caller specify if he
> either wants to allocate per memory block or per hot-added range(device).
> Where per memory block, allows us to do:
>
> add_memory(1GB, MHP_MEMMAP_MEMBLOCK)
> remove_memory(128MB)

Back then, I already mentioned that we might have some users that
remove_memory() they never added in a granularity it wasn't added. My
concerns back then were never fully sorted out.

arch/powerpc/platforms/powernv/memtrace.c

- Will remove memory in memory block size chunks it never added
- What if that memory resides on a DIMM added via MHP_MEMMAP_DEVICE?

Will it at least bail out? Or simply break?

IOW: I am not yet 100% convinced that MHP_MEMMAP_DEVICE is save to be
introduced.

--

Thanks,

David / dhildenb