Re: [PATCH v5 1/5] mm,memory_hotplug: Allocate memmap from the added memory range

From: David Hildenbrand
Date: Wed Mar 24 2021 - 15:18:59 EST


On 24.03.21 17:04, Michal Hocko wrote:
On Wed 24-03-21 15:52:38, David Hildenbrand wrote:
On 24.03.21 15:42, Michal Hocko wrote:
On Wed 24-03-21 13:03:29, Michal Hocko wrote:
On Wed 24-03-21 11:12:59, Oscar Salvador wrote:
[...]
I kind of understand to be reluctant to use vmemmap_pages terminology here, but
unfortunately we need to know about it.
We could rename nr_vmemmap_pages to offset_buddy_pages or something like that.

I am not convinced. It seems you are justr trying to graft the new
functionality in. But I still believe that {on,off}lining shouldn't care
about where their vmemmaps come from at all. It should be a
responsibility of the code which reserves that space to compansate for
accounting. Otherwise we will end up with a hard to maintain code
because expectations would be spread at way too many places. Not to
mention different pfns that the code should care about.

The below is a quick hack on top of this patch to illustrate my
thinking. I have dug out all the vmemmap pieces out of the
{on,off}lining and hooked all the accounting when the space is reserved.
This just compiles without any deeper look so there are likely some
minor problems but I haven't really encountered any major problems or
hacks to introduce into the code. The separation seems to be possible.
The diffstat also looks promising. Am I missing something fundamental in
this?


From a quick glimpse, this touches on two things discussed in the past:

1. If the underlying memory block is offline, all sections are offline. Zone
shrinking code will happily skip over the vmemmap pages and you can end up
with out-of-zone pages assigned to the zone. Can happen in corner cases.

You are right. But do we really care? Those pages should be of no
interest to anybody iterating through zones/nodes anyway.

Well, we were just discussing getting zone/node links + span right for all pages (including for special reserved pages), because it already resulted in BUGs. So I am not convinced that we *don't* have to care.

However, I agree that most code that cares about node/zone spans shouldn't care - e.g., never call set_pfnblock_flags_mask() on such blocks.

But I guess there are corner cases where we would end up with zone_is_empty() == true, not sure what that effect would be ... at least the node cannot vanish as we disallow offlining it while we have a memory block linked to it.


Another thing that comes to my mind is that our zone shrinking code currently searches in PAGES_PER_SUBSECTION (2 MiB IIRC) increments. In case our vmemmap pages would be less than that, we could accidentally shrink the !vmemmap part too much, as we are mis-detecting the type for a PAGES_PER_SUBSECTION block.

IIRC, this would apply for memory block sizes < 128 MiB. Not relevant on x86 and arm64. Could be relevant for ppc64, if we'd ever want to support memmap_on_memory there. Or if we'd ever reduce the section size on some arch below 128 MiB. At least we would have to fence it somehow.



There is no way to know that the memmap of these pages was initialized and
is of value.

2. You heavily fragment zone layout although you might end up with
consecutive zones (e.g., online all hotplugged memory movable)

What would be consequences?

IIRC, set_zone_contiguous() will leave zone->contiguous = false.

This, in turn, will force pageblock_pfn_to_page() via the slow path, turning page isolation a bit slower.

Not a deal breaker, but obviously something where Oscar's original patch can do better.


I yet have to think again about other issues (I remember most issues we discussed back then were related to having the vmemmap only within the same memory block). I think 2) might be tolerable, although unfortunate. Regarding 1), we'll have to dive into more details.

--
Thanks,

David / dhildenb