Re: [PATCH v5 1/4] mm/page_alloc: only free healthy pages in high-order has_hwpoisoned folio

From: Miaohe Lin

Date: Mon Jun 08 2026 - 23:45:26 EST


On 2026/5/31 13:58, Jiaqi Yan wrote:
> At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio
> becomes non-HugeTLB, and it is released to buddy allocator
> as a high-order folio, e.g. a folio that contains 262144 pages
> if the folio was a 1G HugeTLB hugepage.
>
> This is problematic if the HugeTLB hugepage contained HWPoison
> subpages. In that case, since buddy allocator does not check
> HWPoison for non-zero-order folio, the raw HWPoison page can
> be given out with its buddy page and be re-used by either
> kernel or userspace.
>
> Memory failure recovery (MFR) in kernel does attempt to take
> raw HWPoison page off buddy allocator after
> dissolve_free_hugetlb_folio(). However, there is always a time
> window between dissolve_free_hugetlb_folio() frees a HWPoison
> high-order folio to buddy allocator and MFR takes HWPoison
> raw page off buddy allocator.
>
> Another similar situation is when a transparent huge page (THP)
> runs into memory failure but splitting failed. Such THP will
> eventually be released to buddy allocator when owning userspace
> processes are gone, but with certain subpages having HWPoison.
>
> One obvious way to avoid both problems is to add page sanity
> checks in page allocate or free path. However, it is against
> the past efforts to reduce sanity check overhead [1,2,3].
>
> Introduce free_has_hwpoisoned() to only free the healthy pages
> and to exclude the HWPoison ones in the high-order folio.
> The idea is to iterate through the sub-pages of the folio to
> identify contiguous ranges of healthy pages.
>
> free_has_hwpoisoned() is added in free_pages_prepare() as
> a shortcut and is only invoked if PG_has_hwpoisoned indicates
> HWPoison page exists and after checks and preparations in
> free_pages_prepare() all succeeded. free_has_hwpoisoned() then
> can re-use free_prepared_contig_range() [4] to decompose healthy
> ranges into the largest possible chunks of different orders.
> Every chunk meets the requirements to be freed via free_one_page().
>
> free_has_hwpoisoned() has linear time complexity wrt the number
> of pages in the folio. While the power-of-two decomposition
> ensures that the number of calls to the buddy allocator is
> logarithmic for each contiguous healthy range, the mandatory
> linear scan of pages to identify PageHWPoison() defines the
> overall time complexity. For a 1G hugepage having 8 HWPoison
> pages, free_has_hwpoisoned() takes around 1ms on average on
> a system having 56 Intel Skylake physical cores. This is
> 15x to the case of freeing no HWPoison page. The cost is far
> from triggering soft lockup, and fair for handling exceptional
> hardware memory errors.
>
> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@xxxxxxxxxxxxxxxxxxx
> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@xxxxxxxxxxxxxxxxxxx
> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@xxxxxxx
> [4] https://lore.kernel.org/all/20260401101634.2868165-2-usama.anjum@xxxxxxx
>
> Signed-off-by: Jiaqi Yan <jiaqiyan@xxxxxxxxxx>

Thanks for your update. This patch looks good to me while some comments below.

> ---
> mm/page_alloc.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 85 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e47679e7a9db..03df929abca6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -208,6 +208,7 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
> unsigned int pageblock_order __read_mostly;
> #endif
>
> +static void free_has_hwpoisoned(struct page *page, unsigned int order);
> static void __free_pages_ok(struct page *page, unsigned int order,
> fpi_t fpi_flags);
> static void reserve_highatomic_pageblock(struct page *page, int order,
> @@ -1309,6 +1310,14 @@ static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr)
>
> #endif /* CONFIG_MEM_ALLOC_PROFILING */
>
> +/*
> + * Returns
> + * - true: checks and preparations all good, caller can proceed freeing.
> + * - false: do not proceed freeing for one of the following reasons:
> + * 1. Some check failed so it is not safe to proceed freeing.
> + * 2. A compound page has some HWPoison pages. The healthy pages
> + * are already safely freed, and the HWPoison ones isolated.
> + */
> static __always_inline bool __free_pages_prepare(struct page *page,
> unsigned int order, fpi_t fpi_flags)
> {
> @@ -1317,6 +1326,15 @@ static __always_inline bool __free_pages_prepare(struct page *page,
> bool init = want_init_on_free();
> bool compound = PageCompound(page);
> struct folio *folio = page_folio(page);
> + /*
> + * When dealing with compound page, PG_has_hwpoisoned is cleared
> + * with PAGE_FLAGS_SECOND. So the check must be done first.
> + *
> + * Note we can't exclude PG_has_hwpoisoned from PAGE_FLAGS_SECOND.
> + * Because PG_has_hwpoisoned == PG_active, free_page_is_bad() will
> + * confuse and complaint that the first tail page is still active.
> + */
> + bool should_fhh = compound && folio_test_has_hwpoisoned(folio);
>
> if (fpi_flags & FPI_PREPARED)
> return true;
> @@ -1443,6 +1461,16 @@ static __always_inline bool __free_pages_prepare(struct page *page,
>
> debug_pagealloc_unmap_pages(page, 1 << order);
>
> + /*
> + * After breaking down compound page and dealing with page metadata
> + * (e.g. page owner and page alloc tags), take a shortcut if this
> + * was a compound page containing certain HWPoison subpages.
> + */
> + if (should_fhh) {
> + free_has_hwpoisoned(page, order);
> + return false;
> + }

When the code reaches here, the hwpoisoned pages have passed through kernel_poison_pages,
kasan_poison_pages, kernel_init_pages, arch_free_page... These functions might write to
the hwpoisoned pages. Is it safe to do so?

> +
> return true;
> }
>
> @@ -6936,6 +6964,63 @@ void __free_contig_range(unsigned long pfn, unsigned long nr_pages)
> __free_contig_range_common(pfn, nr_pages, /* is_frozen= */ false);
> }
>
> +/*
> + * Given a high-order compound page containing certain number of HWPoison
> + * pages, free only the healthy ones.
> + *
> + * Pages must have passed free_pages_prepare(). Even if having HWPoison
> + * pages, breaking down compound page and updating metadata (e.g. page
> + * owner, alloc tag) can be done together during free_pages_prepare(),
> + * which simplifies the splitting here: unlike __split_unmapped_folio(),
> + * there is no need to turn split pages into a compound page or to carry
> + * metadata.
> + *
> + * It scans every raw page of the compound page and cause nontrivial overhead.
> + * So only use this when the compound page contains HWPoison page(s).
> + *
> + * This implementation needs rework in memdesc world.
> + */
> +static void free_has_hwpoisoned(struct page *page, unsigned int order)
> +{
> + unsigned long curr = page_to_pfn(page);
> + unsigned long end_pfn = curr + (1 << order);
> + unsigned long next;
> + unsigned long total_freed = 0;
> + unsigned long total_hwp = 0;
> +
> + VM_WARN_ON(order == 0);
> + VM_WARN_ON(page->flags.f & PAGE_FLAGS_CHECK_AT_PREP);
> +
> + while (curr < end_pfn) {
> + next = curr;
> +
> + while (next < end_pfn && !PageHWPoison(pfn_to_page(next)))
> + ++next;
> +
> + if (next != end_pfn && PageHWPoison(pfn_to_page(next))) {

Check next != end_pfn should be enough. If we have next != end_pfn, we must have PageHWPoison(pfn_to_page(next))
or we can't exit from above while loop. Or am I miss something?

Thanks.
.