Re: [RFC] hugetlb: add memory-hotplug notifier to only allocate for online nodes
From: David Hildenbrand (Red Hat)
Date: Thu Nov 06 2025 - 05:01:24 EST
On 06.11.25 09:56, Swaraj Gaikwad wrote:
This patch is a RFC on a proposed change to the hugetlb cgroup subsystem’s
css allocation function.
The existing hugetlb_cgroup_css_alloc() uses for_each_node() to allocate
nodeinfo for all nodes, including those which are not online yet
(or never will be). This can waste considerable memory on large-node systems.
The documentation already lists this as a TODO.
We're talking about the
kzalloc_node(sizeof(struct hugetlb_cgroup_per_node), GFP_KERNEL, node_to_alloc);
$ pahole mm/hugetlb_cgroup.o
struct hugetlb_cgroup_per_node {
long unsigned int usage[2]; /* 0 16 */
/* size: 16, cachelines: 1, members: 1 */
/* last cacheline: 16 bytes */
};
16 bytes on x86_64. So nobody should care here.
Of course, it depends on HUGE_MAX_HSTATE.
IIRC only HUGE_MAX_HSTATE goes crazy on that with effectively 15 entries.
15*8 ~128 bytes.
So with 1024 nodes we would be allocating 128 KiB.
And given that this is for each cgroup (right?) I assume it can add up.
Proposed Change:
Introduce a memory hotplug notifier that listens for MEM_ONLINE
events. When a node becomes online, we call the same allocation function
but insted of for_each_node(),using for_each_online_node(). This means
memory is only allocated for nodes which are online, thus reducing waste.
We have a NODE_ADDING_FIRST_MEMORY now, I'd assume that is more suitable?
Feedback Requested:
- Where in the codebase (which file or section) is it most appropriate to
implement and register the memory hotplug notifier for this subsystem?
I'd assume you would have to register in hugetlb_cgroup_css_alloc() and
free in hugetlb_cgroup_css_free().
- Are there best practices or patterns for handling the notifier lifecycle,
especially for unregistering during cgroup or subsystem teardown?
Not that I can think of some :)
- What are the standard methods or tools to test memory hotplug scenarios
for cgroups? Are there ways to reliably trigger node online/offline events
in a development environment?
You can use QEMU to hotplug memory (pc-dimm device) to a CPU+memory-less node and
to then remove it again. If you disable automatic memory onlining, you should be able to
trigger this multiple times without any issues.
- Are there existing test cases or utilities in the kernel tree that would help
to verify correct behavior of this change?
Don't think so.
- Any suggestions for implementation improvements or cleaner API usage?
I'd assume you'd want to look into NODE_ADDING_FIRST_MEMORY.
--
Cheers
David