Re: [PATCH v3] mm/memory-failure: fix hugetlb_lock AA deadlock in get_huge_page_for_hwpoison

From: David Hildenbrand (Arm)

Date: Wed May 20 2026 - 06:38:51 EST


On 5/20/26 10:13, Oscar Salvador (SUSE) wrote:
> On Wed, May 20, 2026 at 10:01:28AM +0800, Wupeng Ma wrote:
>> madvise(MADV_HWPOISON) can trigger a recursive spinlock self-deadlock
>> (AA deadlock) on hugetlb_lock due to a race with concurrent folio
>> unmapping. The race scenario:
>>
>> Thread 1 (madvise MADV_HWPOISON) Thread 2 (unmap)
>> ------------------------------- -----------------
>> madvise_inject_error()
>> get_user_pages_fast() <- refcount++
>> memory_failure(MF_COUNT_INCREASED)
>> get_huge_page_for_hwpoison()
>> spin_lock_irq(&hugetlb_lock)
>> // refcount == 2 (gup + map)
>> // MF_COUNT_INCREASED path:
>> count_increased = true
>> zap_pte_range()
>> page_remove_rmap()
>> put_page() <- drops map ref
>> // refcount: 2 -> 1
>
> Ok, bear with me.
> I am not saying the change itself is wrong (maybe it is not), but how we ended
> up in zap_pte_range() for a hugetlb folio?
> The stacktrace does not seem to have much sense?

Right, that does not make sense. Not even page_remove_rmap() makes sense,
because that function is long gone.

Undisclosed usage of shitty AI?

--
Cheers,

David