Re: [PATCH v3] mm/memory-failure: fix hugetlb_lock AA deadlock in get_huge_page_for_hwpoison

From: mawupeng

Date: Wed May 20 2026 - 07:48:12 EST

On 周三 2026-5-20 16:13, Oscar Salvador (SUSE) wrote:
> On Wed, May 20, 2026 at 10:01:28AM +0800, Wupeng Ma wrote:
>> madvise(MADV_HWPOISON) can trigger a recursive spinlock self-deadlock
>> (AA deadlock) on hugetlb_lock due to a race with concurrent folio
>> unmapping. The race scenario:
>>
>> Thread 1 (madvise MADV_HWPOISON) Thread 2 (unmap)
>> ------------------------------- -----------------
>> madvise_inject_error()
>> get_user_pages_fast() <- refcount++
>> memory_failure(MF_COUNT_INCREASED)
>> get_huge_page_for_hwpoison()
>> spin_lock_irq(&hugetlb_lock)
>> // refcount == 2 (gup + map)
>> // MF_COUNT_INCREASED path:
>> count_increased = true
>> zap_pte_range()
>> page_remove_rmap()
>> put_page() <- drops map ref
>> // refcount: 2 -> 1
>
> Ok, bear with me.
> I am not saying the change itself is wrong (maybe it is not), but how we ended
> up in zap_pte_range() for a hugetlb folio?
> The stacktrace does not seem to have much sense?

You are correct. The refcount dropping logic in the `unmap` path was indeed flawed.
This issue was originally uncovered by fuzzing. Based on the initial stack trace,
we diagnosed it as a recursive locking (AA) deadlock on `hugetlb_lock`.

We initially suspected that `unmap` had prematurely released the folio reference
count, triggering the free path. However, after a thorough analysis of the refcount
state machine and the actual execution context, we confirmed that this hypothesis
is impossible. The root cause lies elsewhere in the locking hierarchy, and we are
currently tracing the exact call path that leads to the nested `hugetlb_lock`
acquisition.

The deadlock can be triggered by injecting hardware poison errors on a hugetlb
page while concurrent unmapping activity occurs. The following minimal userspace
test case demonstrates the race condition by spawning multiple processes to
widen the timing window for the lock contention.

/*
* Repro: hugetlb_lock AA deadlock via consecutive MADV_HWPOISON
*/
main():
// 1. Map hugetlb page
mmap(0x20000000, 0xa000, PROT_READ|PROT_EXEC,
MAP_LOCKED|MAP_HUGETLB|MAP_FIXED, -1, 0)

// 2. Inject hwpoison twice on same page
madvise(0x20000000, 0x4000, MADV_HWPOISON)
madvise(0x20000000, 0x4000, MADV_HWPOISON)

// 3. Repeat in a fork loop to widen race window

Here is the detailed AA deadlock stack with lockdep enabled

============================================
WARNING: possible recursive locking detected
7.0.0-g0eba4942ce53-dirty #335 Not tainted
--------------------------------------------
repro/742 is trying to acquire lock:
ffff800083702958 (hugetlb_lock){....}-{3:3}, at: free_huge_folio+0x194/0x3c0

but task is already holding lock:
ffff800083702958 (hugetlb_lock){....}-{3:3}, at: get_huge_page_for_hwpoison+0x38/

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(hugetlb_lock);
lock(hugetlb_lock);

*** DEADLOCK ***

May be due to missing lock nesting notation

2 locks held by repro/742:
#0: ffff800086f942b0 (mf_mutex){+.+.}-{4:4}, at: memory_failure+0x6c/0xdd8
#1: ffff800083702958 (hugetlb_lock){....}-{3:3}, at: get_huge_page_for_hwpoison

stack backtrace:
CPU: 2 UID: 0 PID: 742 Comm: repro Not tainted 7.0.0-g0eba4942ce53-dirty #335
Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
__lock_acquire+0xe7c/0x1e38
lock_acquire+0x2b8/0x3f8
_raw_spin_lock_irqsave+0x74/0xd8
free_huge_folio+0x194/0x3c0
__folio_put+0x124/0x130
__get_huge_page_for_hwpoison+0x138/0x358
get_huge_page_for_hwpoison+0x48/0x78
memory_failure+0xb4/0xdd8
madvise_do_behavior+0x39c/0x660
do_madvise+0xe4/0x158
__arm64_sys_madvise+0x2c/0x48

>
>
>
>