Re: [PATCH v9 4/8] mm,memory_hotplug: Allocate memmap from the added memory range
From: Oscar Salvador
Date: Wed Apr 21 2021 - 04:15:58 EST
On Tue, Apr 20, 2021 at 12:56:03PM +0200, Michal Hocko wrote:
> On Fri 16-04-21 13:24:07, Oscar Salvador wrote:
> > Physical memory hotadd has to allocate a memmap (struct page array) for
> > the newly added memory section. Currently, alloc_pages_node() is used
> > for those allocations.
> >
> > This has some disadvantages:
> > a) an existing memory is consumed for that purpose
> > (eg: ~2MB per 128MB memory section on x86_64)
>
> I would extend this slightly. This can even lead to extreme cases where
> system goes OOM because the physically hotplugged memory depletes the
> available memory before it is onlined.
Ok.
> > Vmemap page tables can map arbitrary memory.
> > That means that we can simply use the beginning of each memory section and
> > map struct pages there.
>
> Again this can be confusing because this is not what is really happening
> in practice because we are going to have a multisection memory block
> where all sections will be backed by a common reserved space rather than
> per section sparse space. I would go with
>
> "
> Vmemap page tables can map arbitrary memory. That means that we can
> reserve a part of the physically hotadded memory to back vmemmap page
> tables. This implementation uses the beggining of the hotplugged memory
> for that purpose.
> "
Yeah, I thought I fixed that, it should have been "That means that we can simply
use the beginning of each memory block...", but I am ok with your rewording.
> There is quite a large leap from __populate_section_memmap to the
> memory_block that deserves explaining to not lose all the subtle things
> discussed in the past. I think it should be made clear why all the fuzz.
> I would structure it as follows:
> "
> There are some non-obiously things to consider though. Vmemmap
> pages are allocated/freed during the memory hotplug events
> (add_memory_resource, try_remove_memory) when the memory is
> added/removed. This means that the reserved physical range is not online
> yet it is used. The most obvious side effect is that pfn_to_online_page
> returns NULL for those pfns. The current design expects that this
> should be OK as the hotplugged memory is considered a garbage until it
> is onlined. For example hibernation wouldn't save the content of those
> vmmemmaps into the image so it wouldn't be restored on resume but this
> should be OK as there no real content to recover anyway while metadata
> is reachable from other data structures (e.g. vmemmap page tables).
>
> The reserved space is therefore (de)initialized during the {on,off}line
> events (mhp_{de}init_memmap_on_memory). That is done by extracting page
> allocator independent initialization from the regular onlining path.
> The primary reason to handle the reserved space outside of {on,off}line_pages
> is to make each initialization specific to the purpose rather than
> special case them in a single function.
Ok, that definitely adds a valuable information.
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index f209925a5d4e..2e2b2f654f0a 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -173,16 +173,72 @@ static int memory_block_online(struct memory_block *mem)
> > {
> > unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
> > unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
> > + unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
> > + struct zone *zone;
> > + int ret;
> > +
> > + zone = zone_for_pfn_range(mem->online_type, mem->nid, start_pfn, nr_pages);
> > +
> > + /*
> > + * Although vmemmap pages have a different lifecycle than the pages
> > + * they describe (they remain until the memory is unplugged), doing
> > + * their initialization and accounting at memory onlining/offlining
> > + * stage simplifies things a lot.
>
> "simplify things a lot" is not really helpful to people reading the
> code. It would be much better to state reasons here. I would go with
> * stage helps to keep accounting easier to follow - e.g.
> * vmemmaps belong to the same zone as the onlined memory.
Ok
> > static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
> > {
> > const unsigned long end_pfn = start_pfn + nr_pages;
> > - unsigned long pfn;
> > + unsigned long pfn = start_pfn;
> > +
> > + while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) {
> > + (*online_page_callback)(pfn_to_page(pfn), pageblock_order);
> > + pfn += pageblock_nr_pages;
> > + }
>
> I believe we do not need to check for nr_pages as the actual operation
> will never run out of range in practice but the code is more subtle than
If you mean that IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES) can go, that is not
right.
Of course, with your changes below it would not be necesary.
> necessary. Using two different iteration styles is also hurting the code
> readability. I would go with the following
> for (pfn = start_pfn; pfn < end_pfn; ) {
> unsigned long order = min(MAX_ORDER - 1UL, __ffs(pfn));
>
> while (start + (1UL << order) > end_pfn)
> order--;
> (*online_page_callback)(pfn_to_page(pfn), pageblock_order);
> pfn += 1 << order;
> }
>
> which is what __free_pages_memory does already.
this is kinda what I used to have in the early versions, but it was agreed
with David to split it in two loops to make it explicit.
I can go back to that if it is preferred.
> > + if (memmap_on_memory) {
> > + nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
> > + get_nr_vmemmap_pages_cb);
> > + if (nr_vmemmap_pages) {
> > + if (size != memory_block_size_bytes()) {
> > + pr_warn("Refuse to remove %#llx - %#llx,"
> > + "wrong granularity\n",
> > + start, start + size);
> > + return -EINVAL;
> > + }
> > +
> > + /*
> > + * Let remove_pmd_table->free_hugepage_table do the
> > + * right thing if we used vmem_altmap when hot-adding
> > + * the range.
> > + */
> > + mhp_altmap.alloc = nr_vmemmap_pages;
> > + altmap = &mhp_altmap;
> > + }
> > + }
> > +
> > /* remove memmap entry */
> > firmware_map_remove(start, start + size, "System RAM");
>
> I have to say I still dislike this and I would just wrap it inside out
> and do the operation from within walk_memory_blocks but I will not
> insist.
I have to confess I forgot about the details of that dicussion, as we were
quite focused on decoupling vmemmap pages from {online,offline} interface.
Would you mind elaborating a bit more?
--
Oscar Salvador
SUSE L3