Re: [PATCH v3] mm/memory-failure: fix hugetlb_lock AA deadlock in get_huge_page_for_hwpoison

From: Oscar Salvador (SUSE)

Date: Wed May 20 2026 - 04:34:36 EST


On Wed, May 20, 2026 at 10:01:28AM +0800, Wupeng Ma wrote:
> madvise(MADV_HWPOISON) can trigger a recursive spinlock self-deadlock
> (AA deadlock) on hugetlb_lock due to a race with concurrent folio
> unmapping. The race scenario:
>
> Thread 1 (madvise MADV_HWPOISON) Thread 2 (unmap)
> ------------------------------- -----------------
> madvise_inject_error()
> get_user_pages_fast() <- refcount++
> memory_failure(MF_COUNT_INCREASED)
> get_huge_page_for_hwpoison()
> spin_lock_irq(&hugetlb_lock)
> // refcount == 2 (gup + map)
> // MF_COUNT_INCREASED path:
> count_increased = true
> zap_pte_range()
> page_remove_rmap()
> put_page() <- drops map ref
> // refcount: 2 -> 1

Ok, bear with me.
I am not saying the change itself is wrong (maybe it is not), but how we ended
up in zap_pte_range() for a hugetlb folio?
The stacktrace does not seem to have much sense?




--
Oscar Salvador
SUSE Labs