Re: [PATCH v2] Respect mempolicy when calculating surplus huge pages.
From: Joshua Hahn
Date: Tue Jun 23 2026 - 15:44:25 EST
> Presently, when calculating how many huge pages are needed when
> reserving surplus huge pages, the global count of free huge pages
> are used. When reserving with a mempolicy, the global count of free huge
> pages is used even if some/all of those free huge pages are on numa
> nodes outside of the mempolicy.
>
> Reserving surplus huge pages is ultimately best effort even without a
> mempolicy. Restrictions from cpusets and mempolicies further complicate
> calculating correct numbers of surplus huge pages to reserve and
> maintaining which nodes those reservations belong to (see the comment in
> `hugetlb_acct_memory`).
>
> However, we can do a little better when reserving surplus huge pages
> with a mempolicy. This patch changes how to calculate the necessary
> amount of surplus huge pages to reserve by considering the max of either
> the amount of free huge pages on nodes in the mempolicy or the global
> amount of free huge pages. We may still attempt to reserve huge pages
> outside the mempolicy, however, we end up being more likely to reserve
> from nodes in the mempolicy.
>
> Signed-off-by: Charles Haithcock <chaithco@xxxxxxxxxx>
> ---
>
> - v1: Modified `needed` calculation to use `allowed_mems_nr(h)` in order
> to consider free hugetlb pages in our mempolicy.
> - v2: Folded in Joshua Hahn's recommendation [1] to further modify
> `needed` calculation to take the max of either the available hugetlb
> pages in the mempolicy or the globally available hugetlb pages. Allows
> allocations to prioritize nodes in the mempolicy but can still fall
> back to offnode allocations. Also added selftests to check only for
> the edgecase which caused this to initially be reported and sanity
> checks.
>
> [1] https://lore.kernel.org/all/20260602152022.2673803-1-joshua.hahnjy@xxxxxxxxx/
>
> mm/hugetlb.c | 42 +-
> tools/testing/selftests/mm/Makefile | 3 +
> .../selftests/mm/hugetlb_surplus_mempolicy.c | 472 ++++++++++++++++++
> tools/testing/selftests/mm/run_vmtests.sh | 1 +
> 4 files changed, 498 insertions(+), 20 deletions(-)
> create mode 100644 tools/testing/selftests/mm/hugetlb_surplus_mempolicy.c
Hi Charles,
Thanks for following up with a v2! The change to hugetlb.c looks good to
me, I left a small stylistic nit below.
One request I have is that we might separate this commit into two,
one for the mm/hugetlb.c change, and one for the selftests & related
scripts/Makefile change. That way, reviewers can sign off and review
the change separately from the selftests that are being introduced!
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f24bf49be0..bd97f0f434 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2255,6 +2255,23 @@ static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
> return NULL;
> }
>
> +static unsigned int allowed_mems_nr(struct hstate *h)
> +{
> + int node;
> + unsigned int nr = 0;
> + nodemask_t *mbind_nodemask;
> + unsigned int *array = h->free_huge_pages_node;
> + gfp_t gfp_mask = htlb_alloc_mask(h);
> +
> + mbind_nodemask = policy_mbind_nodemask(gfp_mask);
> + for_each_node_mask(node, cpuset_current_mems_allowed) {
> + if (!mbind_nodemask || node_isset(node, *mbind_nodemask))
> + nr += array[node];
> + }
> +
> + return nr;
> +}
> +
> /*
> * Increase the hugetlb pool such that it can accommodate a reservation
> * of size 'delta'.
> @@ -2277,7 +2294,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
> alloc_nodemask = cpuset_current_mems_allowed;
>
> lockdep_assert_held(&hugetlb_lock);
> - needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
> + needed = max((long) (delta - allowed_mems_nr(h)),
> + (long) ((h->resv_huge_pages + delta) - h->free_huge_pages));
> if (needed <= 0) {
> h->resv_huge_pages += delta;
> return 0;
> @@ -2311,8 +2329,9 @@ static int gather_surplus_pages(struct hstate *h, long delta)
> * because either resv_huge_pages or free_huge_pages may have changed.
> */
> spin_lock_irq(&hugetlb_lock);
> - needed = (h->resv_huge_pages + delta) -
> - (h->free_huge_pages + allocated);
> + needed = max((long) ((delta - allowed_mems_nr(h)) - allocated),
> + (long) ((h->resv_huge_pages + delta) -
> + (h->free_huge_pages + allocated)));
What if instead of casting each argument separately, we use
max_t(long, (...), (...)) instead? I think we could make this part look
a bit better : -)
The logic itself looks good to me. And thanks for catching the
+ allocated part, I think I missed that in my original response from v1.
I'll take a look at the selftests in the future, just wanted to get
these comments out first.
Thanks again, I hope you have a great day!
Joshua
> if (needed > 0) {
> if (alloc_ok)
> goto retry;