Re: [PATCH v2 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate
From: Ackerley Tng
Date: Mon Jan 13 2025 - 17:58:06 EST
Hi Peter,
I have an alternative that is based off your patches.
I like the overall refactoring but was a little uncomfortable with
having to add a custom enum map_chg_state. The custom enum centralizes
the possible states, but the states still need to be interpreted again
throughout the function to take action (e.g. having checks like if
(map_chg != MAP_CHG_ENFORCED) and if (map_chg == MAP_CHG_NEEDED)), and I
feel that shifts the problem later to understanding the interpretation
of the states.
I switched back to something close to avoid_reserve, but improved on
that name with your comments, by calling it bypass_vma_reservations. I
feel that calling it cow_from_owner kind of lets the implementation
detail of the CoW use-case bleed into this function.
"bypass_vma_reservations" is named so that it requests this function,
alloc_hugetlb_folio(), to bypass the resv_map. All parts of
alloc_hugetlb_folio() that update the resv_map are guarded by the
bypass_vma_reservations flag.
This alternative proposes to use the following booleans local to
alloc_hugetlb_folio().
1. vma_reservation_exists
vma_reservation_exists represents whether a reservation exists in the
resv_map and is used. vma_reservation_exists defaults to false, and when
bypass_vma_reservations is not requested, the resv_map will be consulted
to see if vma_reservation_exists.
vma_reservation_exists is also used to store the result of the initial
resv_map check, to correctly fix up a race later on.
2. debit_subpool
If a vma_reservation_exists, then this allocation has been reserved in
both the resv_map and subpool, so skip debiting the subpool.
If alloc_hugetlb_folio() was requested to bypass_vma_reservations, the
subpool should still be charged, so debit_subpool is set.
And if debit_subpool, then proceed to do hugepage_subpool_get_pages()
and set up subpool_reservation_exists.
Later on, debit_subpool is used for the matching cleanup in the error
case.
3. subpool_reservation_exists
subpool_reservation_exists represents whether a reservation exists in
the resv_map and is used, analogous to vma_reservation_exists, and the
subpool is only checked if debit_subpool is set.
If debit_subpool is not set, vma_reservation_exists determines whether a
subpool_reservation_exists.
subpool_reservation_exists is then used to guard decrementing
h->resv_huge_pages, which fixes the bug you found.
4. charge_cgroup_rsvd
This has the same condition as debit_subpool, but has a separate
variable for readability.
Later on, charge_cgroup_rsvd is used to determine whether to commit the
charge, or whether to fix up the charge in error cases.
I also refactored dequeue_hugetlb_folio_vma() to
dequeue_hugetlb_folio_with_mpol() to align with
alloc_buddy_hugetlb_folio_with_mpol() to avoid passing gbl_chg into the
function.
gbl_chg is interpreted in alloc_hugetlb_folio() instead. If
subpool_reservation_exists, then try to get a folio by dequeueing it. If
a subpool reservation does not exist, make sure there are available
pages before dequeueing.
If there was no folio from dequeueing for whatever reason, allocate a
new folio.
This could probably be a separate patch but I'd like to hear your
thoughts before doing integration/cleaning up.
These changes have been tested with your reproducer, and here's the test
output from libhugetlbfs test cases:
********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 82 85
* Skipped: 9 9
* PASS: 72 69
* FAIL: 0 0
* Killed by signal: 1 7
* Bad configuration: 0 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********
Ackerley
---
mm/hugetlb.c | 186 ++++++++++++++++++++++++++-------------------------
1 file changed, 94 insertions(+), 92 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6a0ea28f5bac..2cd588d35984 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1333,9 +1333,9 @@ static unsigned long available_huge_pages(struct hstate *h)
return h->free_huge_pages - h->resv_huge_pages;
}
-static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
- struct vm_area_struct *vma,
- unsigned long address, long gbl_chg)
+static struct folio *dequeue_hugetlb_folio_with_mpol(struct hstate *h,
+ struct vm_area_struct *vma,
+ unsigned long address)
{
struct folio *folio = NULL;
struct mempolicy *mpol;
@@ -1343,13 +1343,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
nodemask_t *nodemask;
int nid;
- /*
- * gbl_chg==1 means the allocation requires a new page that was not
- * reserved before. Making sure there's at least one free page.
- */
- if (gbl_chg && !available_huge_pages(h))
- goto err;
-
gfp_mask = htlb_alloc_mask(h);
nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
@@ -1367,9 +1360,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
mpol_cond_put(mpol);
return folio;
-
-err:
- return NULL;
}
/*
@@ -2943,91 +2933,83 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)
return ret;
}
-typedef enum {
- /*
- * For either 0/1: we checked the per-vma resv map, and one resv
- * count either can be reused (0), or an extra needed (1).
- */
- MAP_CHG_REUSE = 0,
- MAP_CHG_NEEDED = 1,
- /*
- * Cannot use per-vma resv count can be used, hence a new resv
- * count is enforced.
- *
- * NOTE: This is mostly identical to MAP_CHG_NEEDED, except
- * that currently vma_needs_reservation() has an unwanted side
- * effect to either use end() or commit() to complete the
- * transaction. Hence it needs to differenciate from NEEDED.
- */
- MAP_CHG_ENFORCED = 2,
-} map_chg_state;
-
/*
- * NOTE! "cow_from_owner" represents a very hacky usage only used in CoW
- * faults of hugetlb private mappings on top of a non-page-cache folio (in
- * which case even if there's a private vma resv map it won't cover such
- * allocation). New call sites should (probably) never set it to true!!
- * When it's set, the allocation will bypass all vma level reservations.
+ * NOTE! "bypass_vma_reservations" represents a very niche usage, when CoW
+ * faults of hugetlb private mappings need to allocate a new page on top of a
+ * non-page-cache folio. In this situation, even if there's a private vma resv
+ * map, the resv map must be bypassed. New call sites should (probably) never
+ * set bypass_vma_reservations to true!!
*/
struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
- unsigned long addr, bool cow_from_owner)
+ unsigned long addr, bool bypass_vma_reservations)
{
struct hugepage_subpool *spool = subpool_vma(vma);
struct hstate *h = hstate_vma(vma);
struct folio *folio;
- long retval, gbl_chg;
- map_chg_state map_chg;
int ret, idx;
struct hugetlb_cgroup *h_cg = NULL;
gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
+ bool vma_reservation_exists = false;
+ bool subpool_reservation_exists;
+ bool debit_subpool;
+ bool charge_cgroup_rsvd;
idx = hstate_index(h);
- /* Whether we need a separate per-vma reservation? */
- if (cow_from_owner) {
- /*
- * Special case! Since it's a CoW on top of a reserved
- * page, the private resv map doesn't count. So it cannot
- * consume the per-vma resv map even if it's reserved.
- */
- map_chg = MAP_CHG_ENFORCED;
- } else {
+ if (!bypass_vma_reservations) {
/*
* Examine the region/reserve map to determine if the process
- * has a reservation for the page to be allocated. A return
- * code of zero indicates a reservation exists (no change).
- */
- retval = vma_needs_reservation(h, vma, addr);
- if (retval < 0)
+ * has a reservation for the page to be allocated and debit the
+ * reservation. If npages_req == 0, a reservation exists and is
+ * used. If npages_req > 0, a reservation has to be taken either
+ * from the subpool or global pool.
+ */
+ int npages_req = vma_needs_reservation(h, vma, addr);
+ if (npages_req < 0)
return ERR_PTR(-ENOMEM);
- map_chg = retval ? MAP_CHG_NEEDED : MAP_CHG_REUSE;
+
+ vma_reservation_exists = npages_req == 0;
}
/*
- * Whether we need a separate global reservation?
+ * If no vma reservation exists, debit the subpool.
*
+ * Even if we were requested to bypass_vma_reservations, debit the
+ * subpool - the subpool still has to be charged for this allocation.
+ */
+ debit_subpool = !vma_reservation_exists || bypass_vma_reservations;
+
+ /*
* Processes that did not create the mapping will have no
* reserves as indicated by the region/reserve map. Check
* that the allocation will not exceed the subpool limit.
* Or if it can get one from the pool reservation directly.
*/
- if (map_chg) {
- gbl_chg = hugepage_subpool_get_pages(spool, 1);
- if (gbl_chg < 0)
+ if (debit_subpool) {
+ int npages_req = hugepage_subpool_get_pages(spool, 1);
+ if (npages_req < 0)
goto out_end_reservation;
- } else {
+
/*
- * If we have the vma reservation ready, no need for extra
- * global reservation.
- */
- gbl_chg = 0;
+ * npages_req == 0 indicates a reservation exists for the
+ * allocation in the subpool and can be used. npages_req > 0
+ * indicates that a reservation must be taken from the global
+ * pool.
+ */
+ subpool_reservation_exists = npages_req == 0;
+ } else {
+ /* A vma reservation implies having a subpool reservation. */
+ subpool_reservation_exists = vma_reservation_exists;
}
/*
- * If this allocation is not consuming a per-vma reservation,
- * charge the hugetlb cgroup now.
+ * If no vma reservation exists, charge the cgroup's reserved quota.
+ *
+ * Even if we were requested to bypass_vma_reservations, the cgroup
+ * still has to be charged for this allocation.
*/
- if (map_chg) {
+ charge_cgroup_rsvd = !vma_reservation_exists || bypass_vma_reservations;
+ if (charge_cgroup_rsvd) {
ret = hugetlb_cgroup_charge_cgroup_rsvd(
idx, pages_per_huge_page(h), &h_cg);
if (ret)
@@ -3039,12 +3021,23 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
goto out_uncharge_cgroup_reservation;
spin_lock_irq(&hugetlb_lock);
- /*
- * glb_chg is passed to indicate whether or not a page must be taken
- * from the global free pool (global change). gbl_chg == 0 indicates
- * a reservation exists for the allocation.
- */
- folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg);
+
+ if (subpool_reservation_exists) {
+ folio = dequeue_hugetlb_folio_with_mpol(h, vma, addr);
+ } else {
+ /*
+ * Since no subpool_reservation_exists, the allocation requires
+ * a new page that was not reserved before. Only dequeue if
+ * there are available pages.
+ */
+ if (available_huge_pages(h)) {
+ folio = dequeue_hugetlb_folio_with_mpol(h, vma, addr);
+ } else {
+ folio = NULL;
+ /* Fallthrough to allocate a new page. */
+ }
+ }
+
if (!folio) {
spin_unlock_irq(&hugetlb_lock);
folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
@@ -3057,19 +3050,17 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
}
/*
- * Either dequeued or buddy-allocated folio needs to add special
- * mark to the folio when it consumes a global reservation.
+ * If subpool_reservation_exists (and is used for this allocation),
+ * decrement resv_huge_pages to indicate that a reservation was used.
*/
- if (!gbl_chg) {
+ if (subpool_reservation_exists) {
folio_set_hugetlb_restore_reserve(folio);
h->resv_huge_pages--;
}
hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio);
- /* If allocation is not consuming a reservation, also store the
- * hugetlb_cgroup pointer on the page.
- */
- if (map_chg) {
+
+ if (charge_cgroup_rsvd) {
hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h),
h_cg, folio);
}
@@ -3078,25 +3069,30 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
hugetlb_set_folio_subpool(folio, spool);
- if (map_chg != MAP_CHG_ENFORCED) {
- /* commit() is only needed if the map_chg is not enforced */
- retval = vma_commit_reservation(h, vma, addr);
+ if (!bypass_vma_reservations) {
+ /*
+ * As long as vma reservations were not bypassed, we need to
+ * commit() to clear up any adds_in_progress in resv_map.
+ */
+ int ret = vma_commit_reservation(h, vma, addr);
/*
- * Check for possible race conditions. When it happens..
- * The page was added to the reservation map between
- * vma_needs_reservation and vma_commit_reservation.
- * This indicates a race with hugetlb_reserve_pages.
+ * If there is a discrepancy in reservation status between the
+ * time of vma_needs_reservation() and vma_commit_reservation(),
+ * then there the page must have been added to the reservation
+ * map between vma_needs_reservation() and
+ * vma_commit_reservation().
+ *
* Adjust for the subpool count incremented above AND
* in hugetlb_reserve_pages for the same page. Also,
* the reservation count added in hugetlb_reserve_pages
* no longer applies.
*/
- if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) {
+ if (unlikely(!vma_reservation_exists && ret == 0)) {
long rsv_adjust;
rsv_adjust = hugepage_subpool_put_pages(spool, 1);
hugetlb_acct_memory(h, -rsv_adjust);
- if (map_chg) {
+ if (charge_cgroup_rsvd) {
spin_lock_irq(&hugetlb_lock);
hugetlb_cgroup_uncharge_folio_rsvd(
hstate_index(h), pages_per_huge_page(h),
@@ -3124,14 +3120,14 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
out_uncharge_cgroup:
hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
out_uncharge_cgroup_reservation:
- if (map_chg)
+ if (charge_cgroup_rsvd)
hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
h_cg);
out_subpool_put:
- if (map_chg)
+ if (debit_subpool)
hugepage_subpool_put_pages(spool, 1);
out_end_reservation:
- if (map_chg != MAP_CHG_ENFORCED)
+ if (!bypass_vma_reservations)
vma_end_reservation(h, vma, addr);
return ERR_PTR(-ENOSPC);
}
@@ -5900,6 +5896,12 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
* be acquired again before returning to the caller, as expected.
*/
spin_unlock(vmf->ptl);
+
+ /*
+ * If this is a CoW from the owner of this page, we
+ * bypass_vma_reservations, since the reservation was already consumed
+ * when the hugetlb folio was first allocated before the fork happened.
+ */
new_folio = alloc_hugetlb_folio(vma, vmf->address, cow_from_owner);
if (IS_ERR(new_folio)) {
--
2.47.1.688.g23fc6f90ad-goog