Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug

From: David Hildenbrand
Date: Wed Feb 19 2025 - 04:09:03 EST

Next message: Anshuman Khandual: "Re: [PATCH v2 3/4] arm64: hugetlb: Fix flush_hugetlb_tlb_range() invalidation level"
Previous message: Shuai Xue: "Re: [PATCH v2 4/5] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages"
In reply to: Gregory Price: "Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug"
Next in thread: Gregory Price: "Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

What's mildly confusing is for pages used for altmap to be accounted for
as if it's an allocation in vmstat - but for that capacity to be chopped
out of the memory-block (it "makes sense" it's just subtly misleading).

Would the following make it better or worse?

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 4765f2928725c..17a4432427051 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -237,9 +237,12 @@ static int memory_block_online(struct memory_block *mem)
* Account once onlining succeeded. If the zone was unpopulated, it is
* now already properly populated.
*/
- if (nr_vmemmap_pages)
+ if (nr_vmemmap_pages) {
adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
nr_vmemmap_pages);
+ adjust_managed_page_count(pfn_to_page(start_pfn),
+ nr_vmemmap_pages);
+ }
mem->zone = zone;
mem_hotplug_done();
@@ -273,17 +276,23 @@ static int memory_block_offline(struct memory_block *mem)
nr_vmemmap_pages = mem->altmap->free;
mem_hotplug_begin();
- if (nr_vmemmap_pages)
+ if (nr_vmemmap_pages) {
adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
-nr_vmemmap_pages);
+ adjust_managed_page_count(pfn_to_page(start_pfn),
+ -nr_vmemmap_pages);
+ }
ret = offline_pages(start_pfn + nr_vmemmap_pages,
nr_pages - nr_vmemmap_pages, mem->zone, mem->group);
if (ret) {
/* offline_pages() failed. Account back. */
- if (nr_vmemmap_pages)
+ if (nr_vmemmap_pages) {
adjust_present_page_count(pfn_to_page(start_pfn),
mem->group, nr_vmemmap_pages);
+ adjust_managed_page_count(pfn_to_page(start_pfn),
+ nr_vmemmap_pages);
+ }
goto out;
}
Then, it would look "just like allocated memory" from that node/zone.

As if, the memmap was allocated immediately when we onlined the memory
(see below).

I thought the system was saying i'd allocated memory (from the 'free'
capacity) instead of just reducing capacity.

The question is whether you want that memory to be hidden from MemTotal
(carveout?) or treated just like allocated memory (allocation?).

If you treat the memmap as "just a memory allocation after early boot"
and "memap_on_memory" telling you to allocate that memory from the
hotplugged memory instead of the buddy, then "carveout"
might be more of an internal implementation detail to achieve that memory
allocation.

stupid question - it sorta seems like you'd want this as the default
setting for driver-managed hotplug memory blocks, but I suppose for
very small blocks there's problems (as described in the docs).

The issue is that it is per-memblock. So you'll never have 1 GiB ranges
of consecutive usable memory (e.g., 1 GiB hugetlb page).

That makes sense, i had not considered this. Although it only applies
for small blocks - which is basically an indictment of this suggestion:

https://lore.kernel.org/linux-mm/20250127153405.3379117-1-gourry@xxxxxxxxxx/

So I'll have to consider this and whether this should be a default.
It's probably this is enough to nak this entirely.

... that said ....

Interestingly, when I tried allocating 1GiB hugetlb pages on a dax device
in ZONE_MOVABLE (without memmap_on_memory) - the allocation fails silently
regardless of block size (tried both 2GB and 256MB). I can't find a reason
why this would be the case in the existing documentation.

Right, it only currently works with ZONE_NORMAL, because 1 GiB pages are
considered unmovable in practice (try reliably finding a 1 GiB area to
migrate the memory to during memory unplug ... when these hugetlb things are
unswappable etc.).

I documented it under https://www.kernel.org/doc/html/latest/admin-guide/mm/memory-hotplug.html

"Gigantic pages are unmovable, resulting in user space consuming a lot of unmovable memory."

If we ever support THP in that size range, we might consider them movable
because we can just split/swapout them when allcoating a migration target
fails.

(note: hugepage migration is enabled in build config, so it's not that)

If I enable one block (256MB) into ZONE_NORMAL, and the remainder in
movable (with memmap_on_memory=n) the allocation still fails, and:

nr_slab_unreclaimable 43

in node1/vmstat - where previously there was nothing.

Onlining the dax devices into ZONE_NORMAL successfully allowed 1GiB huge
pages to allocate.
> This used the /sys/bus/node/devices/node1/hugepages/* interfaces to test

Using the /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages with
interleave mempolicy - all hugepages end up on ZONE_NORMAL.

(v6.13 base kernel)

This behavior is *curious* to say the least. Not sure if bug, or some
nuance missing from the documentation - but certainly glad I caught it.

See above :)

I thought we had that? See MHP_MEMMAP_ON_MEMORY set by dax/kmem.

IIRC, the global toggle must be enabled for the driver option to be considered.

Oh, well, that's an extra layer I missed. So there's:

build:
CONFIG_MHP_MEMMAP_ON_MEMORY=y
CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE=y
global:
/sys/module/memory_hotplug/parameters/memmap_on_memory
device:
/sys/bus/dax/devices/dax0.0/memmap_on_memory

And looking at it - this does seem to be the default for dax.

So I can drop the existing `nuance movable/memmap` section and just
replace it with the hugetlb subtleties x_x.

I appreciate the clarifications here, sorry for the incorrect info and
the increasing confusing.

No worries. If you have ideas on what to improve in the memory hotplug
docs, please let me know!

--
Cheers,

David / dhildenb

Next message: Anshuman Khandual: "Re: [PATCH v2 3/4] arm64: hugetlb: Fix flush_hugetlb_tlb_range() invalidation level"
Previous message: Shuai Xue: "Re: [PATCH v2 4/5] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages"
In reply to: Gregory Price: "Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug"
Next in thread: Gregory Price: "Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]