Re: [PATCH v6 1/2] mm,hwpoison: fix race with hugetlb page allocation

From: Luck, Tony
Date: Thu Aug 12 2021 - 00:28:16 EST


On Fri, Jun 04, 2021 at 08:36:31AM +0900, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@xxxxxxx>
>
> When hugetlb page fault (under overcommitting situation) and
> memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:
>
> CPU0: CPU1:
>
> gather_surplus_pages()
> page = alloc_surplus_huge_page()
> memory_failure_hugetlb()
> get_hwpoison_page(page)
> __get_hwpoison_page(page)
> get_page_unless_zero(page)
> zero = put_page_testzero(page)
> VM_BUG_ON_PAGE(!zero, page)
> enqueue_huge_page(h, page)
> put_page(page)
>
> __get_hwpoison_page() only checks the page refcount before taking an
> additional one for memory error handling, which is not enough because
> there's a time window where compound pages have non-zero refcount during
> hugetlb page initialization.
>
> So make __get_hwpoison_page() check page status a bit more for hugetlb
> pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags
> under hugetlb_lock makes sure that the hugetlb page is not transitive.
> It's notable that another new function, HWPoisonHandlable(), is helpful
> to prevent a race against other transitive page states (like a generic
> compound page just before PageHuge becomes true).

I'm seeing some strange results when doing a simple injection/recovery.

Current upstream often fails to offline the page with messages like:

"high-order kernel page"
or
"unknown page"

Things were working in v5.12. Broken in v5.13.

Bisect says that:

25182f05ffed ("mm,hwpoison: fix race with hugetlb page allocation")

is the culprit (though it is possible that there is more than one
issue ... failure symptoms changed a bit during the bisection).

This commit doesn't revert automatically from upstream. But it
does revert from v5.13. Running with this reverted from v5.13
gives kernel that recovers normally[1] from hundreds of consecutive
error injections.

-Tony

[1] Almost normally. My test catches SIGBUS and prints the virtual
address from the siginfo_t structure. Sometimes the address is correct
other times it is NULL.