Re: [PATCH v5 1/5] mm,memory_hotplug: Allocate memmap from the added memory range

From: Oscar Salvador
Date: Wed Mar 24 2021 - 06:13:49 EST


On Tue, Mar 23, 2021 at 11:11:58AM +0100, Michal Hocko wrote:
> [Sorry for a long overdue review. I didn't have time to follow previous
> versions so I am sorry if some of my concerns have been discussed
> already]

No worries, let's go ;-)

> I was playing with movable_node and performance implications back in
> 2017 (unfortunately I do not have specific numbers anymore) and the
> setup was a bit extreme - a single node (0) with normal zones and all
> other nodes with movable memory only. So not only struct pages but any
> other kernel metadata were on a remote node. I remember I could see
> clear performance drop scaling with the distance from node 0 somewhere
> betweem 5-10% on kbuild bound on a movable node.

I see. Yes, it is a rather extreme case, but I think it clearly shows the impact
of having metadata structures on a non-local node.


> In fact beginning of the memory block should be sufficient as sections
> cannot be hotremoved without the rest of the memory block.

Sorry, I meant memory block here.

> > struct pages which back the allocated space then just need to be treated
> > carefully.
> >
> > Implementation wise we will reuse vmem_altmap infrastructure to override
> > the default allocator used by __populate_section_memmap.
> > Part of the implementation also relies on memory_block structure gaining
> > a new field which specifies the number of vmemmap_pages at the beginning.
>
> Here you are talking about memory block rather than section.

Yes, see above, it should have been memory block in both cases.

> > Hot-remove:
> >
> > We need to be careful when removing memory, as adding and
> > removing memory needs to be done with the same granularity.
> > To check that this assumption is not violated, we check the
> > memory range we want to remove and if a) any memory block has
> > vmemmap pages and b) the range spans more than a single memory
> > block, we scream out loud and refuse to proceed.
>
> Is this a real problem? If each memory block has its own vmemmap then we
> should be just fine, no?

Not entirely.

Assume this:

- memory_block_size = 128MB
- add_memory(256MB) : no uses altmap because size != memory_block_size
- add_memory(128MB) : uses altmap

Now, when trying to remove the memory, we should construct the altmap to let
remove_pmd_table->free_hugepage_table() know that it needs to call vmem_altmap_free()
instead of free_pagetable() for those sections that were populated using altmap.

But that becomes trickier to handle if user does remove_memory(384MB) at once.

The only reasonable way I can think of is something like:

/*
* Try to diferentiate which ranges used altmap when populating vmemmap,
* and construct the altmap for those
*/
loop(size / section_size)
if (range_used altmap)
arch_remove_memory(nid, start, size, altmap);
else
arch_remove_memory(nid, start, size, NULL);

But I do not think this is any better than make this scenario completely a NO-NO,
because in the end, this is asking for trouble.
And yes, normal qemu/barematal users does not have the hability to play these
kind of tricks, as baremetal has HW limitations and qemu creates a device for
every range you hot-add (so you are tied to that device when removing memory
as well), but other users e.g: virtio-mem can do that.

> I would appreciate some more description of the patch itself. The above
> outlines a highlevel problems and design. The patch is quite large and
> it acts on several layers - physical hotplug, {on,off}lining and sysfs
> layer.

Ok, will try to come up with something more complete wrt changelog.

> Let me capture my thinking:
> - from the top level
> - sysfs interfaces - memory block is extended to contain the number of
> vmemmap pages reserved from the beginning of the block for all
> memory sections belonging to the block.
yes

> - add_memory_resource is the entry point to reserve the vmemmap space
> for the block. This is an opt-in feature (MHP_MEMMAP_ON_MEMORY) and
> there is no current user at this stage.
yes

> - vmem_altmap is instructed to use the reserved vmemmap space as the
> backing storage for the vmemmap struct pages. Via arch_add_memory->
> __populate_section_memmap.
yes

> - online_pages for some reason needs to know about the reserved vmemmap
> space. Why? It already knows the intial pfn to online. Why cannot
> caller simply alter both start pfn and nr_pages to online everything
> after the vmemmap space? This is somehow conflating the mem block
> concept deeper into onlining.
> - the same applies to offlining.

Because some counters need not only the buddy_nr_pages, but the complete
range.

So, let us see what online_pages() do (offline_pages() as well but slightly
different in some areas)

- move_pfn_range_to_zone():
1) Resize node and zone spanned pages
* If we were only to pass the nr_pages without the vmemmap pages,
node/zone's spanned pages would be wrong as vmemmap pages would not
be accounted in there.

2) Inits struct pages by memmap_init_range() and sets its migratetype
* If we were only to pass the nr_pages without the vmemmap pages,
vmemmap pages would be totally unitialized.
We also set its migratetype to MIGRATE_UNMOVABLE.
Previous versions initialize vmemmap pages in another place but
there was a consensus to do it here.

So on, this case, we have:

if (nr_vmemmap_pages)
move_pfn_range_to_zone(zone, pfn, nr_vmemmap_pages, NULL,
MIGRATE_UNMOVABLE);
move_pfn_range_to_zone(zone, buddy_start_pfn, buddy_nr_pages, NULL
MIGRATE_ISOLATE);

- Increment zone->present_pages
* We need to account buddy_pages + vmemmap_pages here

- zone->zone_pgdat->node_present_pages
* Same as above

- online_pages_range() (onlines the pages, __and__ the sections)
* Here do not only need the (buddy_pages, end_pages), but (vmemmap_pages, end_pages)
as well, because on one hand we do:

online_pages_range()
{
for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
(*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);

online_mem_sections(start_pfn, end_pfn);
}

For the call to online_mem_sections, we need to whole range (including the vmemmap
pages), otherwise, if a whole section only contains vmemmap pages, the section
might be left marked as offline, and that is troublesome.


As I said, the same applies to offline_pages(), but with slightly tweaks here and
there because it handles somewhat different things.

I kind of understand to be reluctant to use vmemmap_pages terminology here, but
unfortunately we need to know about it.
We could rename nr_vmemmap_pages to offset_buddy_pages or something like that.


> - finally hotremove - which is the most tricky part. try_remove_memory
> learns about vmemmap reserved space and provides it to __remove_pages
> and eventually all the way down to remove_pagetable via altmap
> Now a question and something I have stumbled over few years back when
> looking into this. Let's say you have multi section memblock so the
> first section of the block backs vmemmaps for all other sections.
> What happens when you drop the first worth of section before tearing
> down all other vmemmaps?

I guess you refer to the case were:

- memory_block_size: 1GB (8 sections)
[memory_block] : first 4096 pfns are for vmemmap

Nothing happens, but I see where your comment is comming from.

Back in 2017, in your prototype, there were two different things:

- try_remove_memory (I dunno how it was called back then) still worked
with pages, not pfns
- arch_memory_memory() either did not have the altmap stuff, or we were
not setting it properly, but I remember that in your prototype
you were handling vmemmap pages in free_hugepage_table()->free_pagetable()
being carefull to not free them.

Back then, when removing the first vmemmap backing further sections, when
then dereferencing those sections in free_pagetable(), we would crash because
the mapping was not there anymore.
This cannot longer happen.

> [...]
> > @@ -185,10 +185,11 @@ memory_block_action(unsigned long start_section_nr, unsigned long action,
> >
> > switch (action) {
> > case MEM_ONLINE:
> > - ret = online_pages(start_pfn, nr_pages, online_type, nid);
> > + ret = online_pages(start_pfn, nr_pages, nr_vmemmap_pages,
> > + online_type, nid);
> > break;
>
> I would just offset start_pfn and nr_pages.

As stated above, this is not possible.


> > @@ -603,7 +606,7 @@ static int add_memory_block(unsigned long base_section_nr)
> > if (section_count == 0)
> > return 0;
> > return init_memory_block(memory_block_id(base_section_nr),
> > - MEM_ONLINE);
> > + MEM_ONLINE, 0);
>
> This would deserve a comment.
> /* Early init code to create memory blocks for all the memory.
> * Backed by bootmem struct pages so no vmemmap reserved space.
> */

Ok, will add it.

> > {
> > const unsigned long end_pfn = start_pfn + nr_pages;
> > - unsigned long pfn;
> > + unsigned long pfn = buddy_start_pfn;
> > +
> > + /*
> > + * When using memmap_on_memory, the range might be unaligned as the
> > + * first pfns are used for vmemmap pages. Align it in case we need to.
> > + */
> > + VM_BUG_ON(!IS_ALIGNED(pfn, pageblock_nr_pages));
>
> No this is not something VM_BUG_ON should be used for. This is perfectly
> recoverable situation. Besides that this is a wrong layer to care. All
> the fixup should happen up in the call chain.

This should not happen anymore as mhp_support_memmap_on_memory() does not let
to use MHP_MEMMAP_ON_MEMORY if range is not pageblock_nr_pages.

So this can go.

> > int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
> > - int online_type, int nid)
> > + unsigned long nr_vmemmap_pages, int online_type, int nid)
> > {
> > - unsigned long flags;
> > + unsigned long flags, buddy_start_pfn, buddy_nr_pages;
> > struct zone *zone;
> > int need_zonelists_rebuild = 0;
> > int ret;
>
> As already mentioned I believe this would be much easier to follow if
> the given pfn really denotes a first pfn to online rather than learn the
> code about vmemmap space which is not really interesting from the
> onlining POV. Struct pages are already create. All we need is to online
> them for using.
> Just have a look at pfn vs. buddy_start_pfn usage. Why should
> zone_for_pfn_range, node_states_check_changes_online, memory_notify ase
> the former rather than later? As mentioned above online_pages_range is
> just more complex by doing that.
>
> Sure there are some consistency checks which are more convenient with
> the actual pfn start but I believe those shouldn't be a reason for
> obfuscating the code and mixing layers.

I think I explained this above, but let me repeat just in case.
Take into account that boot vmemmap_pages are also accounted to:
- zone's spanned_pages/present_pages
- node's spanned_pages/present_pages

And those pages are also initialized somehow, so we need to initialize the hotplug
vmemmap pages as well, and account them.

As I said, we can use a different terminology and name it different, but we need to
- properly account them
- properly initialize them

And I __guess__ we could do it somewhere off the {online,offline_pages()) land,
but I see that trickier and not worh it.

> > + IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) &&
> > + size == memory_block_size_bytes() &&
> > + IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
> > + IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
>
> This is likely more complex than necessary. Is it ever possible that
> remaining_size won't be aligned properly when vmemmap_size is PMD_SIZE
> aligned?

Yes, on arm64 with large pages depending on HUGETLB support this can lead to
one condition be true while the other not.

> > @@ -1563,10 +1639,11 @@ static int count_system_ram_pages_cb(unsigned long start_pfn,
> > return 0;
> > }
> >
> > -int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> > +int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> > + unsigned long nr_vmemmap_pages)
>
> same concern as online pages. Nobody should really care about vmemmap
> reserved space. Maybe the accounting (count_system_ram_pages_cb) will
> need some compensation but I have to say I got lost in this accounting
> wrt to memory hotplugged memory. Where do we account hotadded memory to
> system_ram_pages?

Quick summary of account:

- online_pages()->move_pfn_range_to_zone(): Accounts for node/zone's spanned pages
- online_pages()->zone->present_pages += nr_pages;
- zone->zone_pgdat->node_present_pages += nr_pages;
- online_pages()->online_pages_range()->generic_online_page()->totalram_pages_add():
Accounts for totalram_pages
- online_pages()->adjust_managed_page_count(): Accounts for zone->managed_pages

So, as you can see, we have a mix with of spanned_pages,present_pages and
managed_pages in both {offline,online}_pages().
Vmemmap pages need to be properly accounted to spanned_pages,present_pages,
as we account bootmem vmemmap pages, but they do not be accounted in
managed_pages.

> This made me scratch my head. I do not think this works for size
> spanning multiple memory blocks. Maybe we do not allow something like
> that happening. The logic seems inside out to me. I believe you want to
> either pull arch_remove_memory into the walk_memory_blocks callback and
> handle each memory block this way.

Here, what we fence off is the scenario I mentioned at the beginning, where
someone may call try_remove_memory() with memory_blocks both containing and
not vmemmap_pages.

So, if the user opt-in the feature, he needs to work with the same granularity
in the add and remove operations.

Well, that was a good feedback indeed, and a large one, I hope to have clarified
some of the questions raised.

--
Oscar Salvador
SUSE L3