Re: [PATCH] Respect mempolicy when calculating surplus huge pages.

From: Joshua Hahn

Date: Tue Jun 02 2026 - 11:26:25 EST

On Wed, 27 May 2026 16:48:46 -0600 Charles Haithcock <chaithco@xxxxxxxxxx> wrote:

> Presently, when calculating how many huge pages are needed when
> reserving surplus huge pages, the global count of free huge pages
> are used. When reserving with a mempolicy, the global count of free huge
> pages is used even if some/all of those free huge pages are on numa
> nodes outside of the mempolicy. Fix it so free huge pages only on nodes
> within the mempolicy are considered.

Hello Charles, thank you for the patch!

I just wanted to add that it seems like this is a known issue. From the
comment in hugetlb_acct_memory (the only caller of gather_surplus_pages)
we have the following comment block:

/*
* When cpuset is configured, it breaks the strict hugetlb page
* reservation as the accounting is done on a global variable. Such
* reservation is completely rubbish in the presence of cpuset because
* the reservation is not checked against page availability for the
* current cpuset. Application can still potentially OOM'ed by kernel
* with lack of free htlb page in cpuset that the task is in.
* Attempt to enforce strict accounting with cpuset is almost
* impossible (or too ugly) because cpuset is too fluid that
* task or memory node can be dynamically moved between cpusets.
*
* The change of semantics for shared hugetlb mapping with cpuset is
* undesirable. However, in order to preserve some of the semantics,
* we fall back to check against current free page availability as
* a best attempt and hopefully to minimize the impact of changing
* semantics that cpuset has.
*
* Apart from cpuset, we also have memory policy mechanism that
* also determines from which node the kernel will allocate memory
* in a NUMA system. So similar to cpuset, we also should consider
* the memory policy of the current task. Similar to the description
* above.

So it would appear that getting an exact number of pages to allocate,
and ensure that there are no changes with the reservation or which nodes
those reservations actually go to is a lot more difficult. But I think
we can do a bit better.

FWIW, I think over-allocating is actually not fatal (although overallocating
by a lot is obviously not desirable) since we free all the unused hugetlb
pages at the end of gather_surplus_pages. I wonder if an approach like this
could work:

@@ -2260,7 +2277,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
alloc_nodemask = cpuset_current_mems_allowed;

lockdep_assert_held(&hugetlb_lock);
- needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
+ needed = max(delta - allowed_mems_nr(h),
+ (h->resv_huge_pages + delta) - h->free_huge_pages);
if (needed <= 0) {
h->resv_huge_pages += delta;
return 0;
@@ -2294,8 +2312,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
* because either resv_huge_pages or free_huge_pages may have changed.
*/
spin_lock_irq(&hugetlb_lock);
- needed = (h->resv_huge_pages + delta) -
- (h->free_huge_pages + allocated);
+ needed = max((h->resv_huge_pages + delta) - h->free_huge_pages,
+ delta - allowed_mems_nr(h)) - allocated;
if (needed > 0) {
if (alloc_ok)
goto retry;

So we compare the mempolicy-perspective "needed" and compare it to the
global "needed" and take whatever. Since we are taking a max it should
only ever make it more likely to actually succeed with the mempolicy-bound
hugetlb page usage, even though we still can't make guarantees since
a free page on our node may be taken by a different reservation later.

Let me know what you think. Thanks again!
Joshua