[PATCH] mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages

From: Muchun Song
Date: Wed Apr 21 2021 - 02:03:42 EST


The possible bad scenario:

CPU0: CPU1:

gather_surplus_pages()
page = alloc_surplus_huge_page()
memory_failure_hugetlb()
get_hwpoison_page(page)
__get_hwpoison_page(page)
get_page_unless_zero(page)
zero = put_page_testzero(page)
VM_BUG_ON_PAGE(!zero, page)
enqueue_huge_page(h, page)
put_page(page)

The refcount can possibly be increased by memory-failure or soft_offline
handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the
hugetlb pool list.

Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx>
---
mm/hugetlb.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3476aa06da70..6c96332db34b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2145,17 +2145,14 @@ static int gather_surplus_pages(struct hstate *h, long delta)

/* Free the needed pages to the hugetlb pool */
list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
- int zeroed;
-
if ((--needed) < 0)
break;
/*
- * This page is now managed by the hugetlb allocator and has
- * no users -- drop the buddy allocator's reference.
+ * The refcount can possibly be increased by memory-failure or
+ * soft_offline handlers.
*/
- zeroed = put_page_testzero(page);
- VM_BUG_ON_PAGE(!zeroed, page);
- enqueue_huge_page(h, page);
+ if (likely(put_page_testzero(page)))
+ enqueue_huge_page(h, page);
}
free:
spin_unlock_irq(&hugetlb_lock);
--
2.11.0