Re: [PATCH] mm: page_isolation: Avoid hugepage scan step underflow

From: Kaitao Cheng

Date: Tue Jun 02 2026 - 03:15:02 EST

在 2026/6/2 00:21, David Hildenbrand (Arm) 写道:
> On 5/22/26 11:35, Kaitao Cheng wrote:
>> 在 2026/5/20 01:54, Andrew Morton 写道:
>>> On Tue, 19 May 2026 20:16:46 +0800 Kaitao Cheng <kaitao.cheng@xxxxxxxxx> wrote:
>>>
>>>> page_is_unmovable() checks HugeTLB pages without holding hugetlb_lock and
>>>> without pinning the folio. This is intentional for the pageblock scanning
>>>> paths, but it means the HugeTLB folio can be freed concurrently after
>>>> PageHuge() or folio_test_hugetlb() succeeds.
>>>
>>> Thanks. What are the userspace-visible runtime effects of this?
>>
>> The direct user-visible effect is not memory corruption, but degraded forward
>> progress in the page isolation / contiguous allocation path.
>>
>> If the race makes folio_nr_pages() return 1 while the current page is still
>> treated as a tail page of the old HugeTLB folio, the computed step can
>> underflow:
>>
>> step = folio_nr_pages(folio) - folio_page_idx(folio, page);
>>
>> The caller then does:
>>
>> start_pfn += step;
>>
>> With unsigned arithmetic this can wrap and move start_pfn backwards, typically
>> near the beginning of the old hugepage range rather than advancing past it. In
>> many cases this only means rescanning part of the same hugepage range, so the
>> effect may be limited to extra scanning work.
>>
>> However, it still violates the scanner's forward-progress assumption: step
>> is expected to advance start_pfn. If the same transient state is observed
>> repeatedly, the scanner can keep revisiting the same PFNs, causing excessive
>> latency and, in the worst case, an apparent stall in operations that rely on
>> page isolation or contiguous allocation.
>>
>>
>> Here is another point raised by AI, the old code also used folio_test_lru()
>> on a folio pointer obtained without holding a reference. If the folio is
>> freed and the old head page is reused or observed as a tail page of another
>> compound page, folio_test_lru() can reach const_folio_flags(), which asserts
>> that the passed folio is not a tail page. On DEBUG_VM kernels, that can
>> trigger a VM_BUG_ON_PGFLAGS() crash.
>>
>>>> The existing code avoids folio_hstate() and uses size_to_hstate() because
>>>> the HugeTLB flag may already have been cleared. However, if
>>>> size_to_hstate() returns NULL, the code still falls through and computes
>>>> the scan step from folio_nr_pages(). If the folio has been freed and the
>>>> head/large state has been cleared, folio_nr_pages() can return 1. When the
>>>> current page is a tail page, subtracting folio_page_idx() from 1 can
>>>> underflow and make the scanner skip too far.
>>>>
>>>> Treat a NULL hstate as unmovable so the scanner does not try to skip over
>>>> an unstable HugeTLB folio. Once a valid hstate is found, derive the number
>>>> of pages from the hstate instead of reading the folio size again. Also
>>>> validate the page index before computing the step to avoid underflow if the
>>>> page/folio relationship changed concurrently.
>>>
>>> This code sounds rather sketchy, and it sounds like it will remain
>>> sketchy after the patch. And AI review says "hey, this code is
>>> sketchy":
>>> https://sashiko.dev/#/patchset/20260519121646.40833-1-kaitao.cheng@xxxxxxxxx
>>
>> Following David Hildenbrand's suggestion, I made some changes as shown below.
>> I'm not sure whether there are still any other issues.
>>
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -41,8 +41,14 @@ bool page_is_unmovable(struct zone *zone, struct page *page,
>> * We need not scan over tail pages because we don't
>> * handle each tail page individually in migration.
>> */
>> - if (PageHuge(page) || PageCompound(page)) {
>> + if (PageCompound(page)) {
>> struct folio *folio = page_folio(page);
>> + unsigned long nr_pages, pfn;
>> + unsigned int order;
>> +
>> + order = compound_order(&folio->page);
>> + if (order > MAX_FOLIO_ORDER)
>> + return true;
>
> Would we also have to care about non-order-of-2?

There should not be any non-order-of-2 compound pages here. Neither the
buddy allocator's allocation model nor the folio_size() / size_to_hstate()
implementation supports non-order-of-two compound pages.

>>
>> if (folio_test_hugetlb(folio)) {
>> struct hstate *h;
>> @@ -54,15 +60,16 @@ bool page_is_unmovable(struct zone *zone, struct page *page,
>> * The huge page may be freed so can not
>> * use folio_hstate() directly.
>> */
>> - h = size_to_hstate(folio_size(folio));
>> - if (h && !hugepage_migration_supported(h))
>> + h = size_to_hstate(PAGE_SIZE << order);
>> + if (!h || !hugepage_migration_supported(h))
>> return true;
>> -
>> - } else if (!folio_test_lru(folio)) {
>> + } else if (!PageLRU(page)) {
>
> Hm, is that required because we could VM_BUG_ON?

Yes, that is one of the reasons for the change.

page_is_unmovable() does not hold a reference on the folio, and the folio
can be freed and reused concurrently while this scanner is running. In that
case, the previously obtained folio pointer may no longer refer to a valid
folio head. If the original head page is reused as a tail page of another
compound page, folio_test_lru(folio) can hit the folio flag checks and
potentially trigger a VM_BUG_ON_PGFLAGS().

>> return true;
>> }
>>
>> - *step = folio_nr_pages(folio) - folio_page_idx(folio, page);
>> + nr_pages = 1UL << order;
>> + pfn = page_to_pfn(page);
>> + *step = (pfn | (nr_pages - 1)) + 1 - pfn;
>> return false;
>> }
>>
>
>

--
Thanks
Kaitao Cheng