Re: [RFC PATCH] mm: hugetlb: remove __GFP_THISNODE flag when dissolving the old hugetlb

From: Baolin Wang
Date: Thu Feb 01 2024 - 20:35:55 EST




On 2/1/2024 11:27 PM, Michal Hocko wrote:
On Thu 01-02-24 21:31:13, Baolin Wang wrote:
Since commit 369fa227c219 ("mm: make alloc_contig_range handle free
hugetlb pages"), the alloc_contig_range() can handle free hugetlb pages
by allocating a new fresh hugepage, and replacing the old one in the
free hugepage pool.

However, our customers can still see the failure of alloc_contig_range()
when seeing a free hugetlb page. The reason is that, there are few memory
on the old hugetlb page's node, and it can not allocate a fresh hugetlb
page on the old hugetlb page's node in isolate_or_dissolve_huge_page() with
setting __GFP_THISNODE flag. This makes sense to some degree.

Later, the commit ae37c7ff79f1 (" mm: make alloc_contig_range handle
in-use hugetlb pages") handles the in-use hugetlb pages by isolating it
and doing migration in __alloc_contig_migrate_range(), but it can allow
fallbacking to other numa node when allocating a new hugetlb in
alloc_migration_target().

This introduces inconsistency to handling free and in-use hugetlb.
Considering the CMA allocation and memory hotplug relying on the
alloc_contig_range() are important in some scenarios, as well as keeping
the consistent hugetlb handling, we should remove the __GFP_THISNODE flag
in isolate_or_dissolve_huge_page() to allow fallbacking to other numa node,
which can solve the failure of alloc_contig_range() in our case.

I do agree that the inconsistency is not really good but I am not sure
dropping __GFP_THISNODE is the right way forward. Breaking pre-allocated
per-node pools might result in unexpected failures when node bound
workloads doesn't get what is asssumed available. Keep in mind that our
user APIs allow to pre-allocate per-node pools separately.

Yes, I agree, that is also what I concered. But sometimes users don't care about the distribution of per-node hugetlb, instead they are more concerned about the success of cma allocation or memory hotplug.

The in-use hugetlb is a very similar case. While having a temporarily
misplaced page doesn't really look terrible once that hugetlb page is
released back into the pool we are back to the case above. Either we
make sure that the node affinity is restored later on or it shouldn't be
migrated to a different node at all.

Agree. So how about below changing?
(1) disallow fallbacking to other nodes when handing in-use hugetlb, which can ensure consistent behavior in handling hugetlb.
(2) introduce a new sysctl (may be named as "hugetlb_allow_fallback_nodes") for users to control to allow fallbacking, that can solve the CMA or memory hotplug failures that users are more concerned about.