Re: [RFC PATCH] mm/huge_memory: do not add dropped split tail folios to LRU

From: Zhaoyang Huang

Date: Wed Jun 10 2026 - 22:39:57 EST

On Thu, Jun 11, 2026 at 9:56 AM Zi Yan <ziy@xxxxxxxxxx> wrote:
>
> On 10 Jun 2026, at 21:39, Zhaoyang Huang wrote:
>
> > On Wed, Jun 10, 2026 at 10:38 PM Zi Yan <ziy@xxxxxxxxxx> wrote:
> >>
> >> On 10 Jun 2026, at 8:50, David Hildenbrand (Arm) wrote:
> >>
> >>> On 6/10/26 14:05, zhaoyang.huang wrote:
> >>>> From: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
> >>>>
> >>>> The kernel panics are keeping to be reported especially when the f2fs
> >>>> partition get almost full. By investigation, we find that the reason is
> >>>> one f2fs page got freed to buddy without being deleted from LRU and the
> >>>> root cause is the race happened in [2] which is enrolled by this commit.
> >>>> We solve this issue by reverting a f2fs commit 9609dd704725 ("f2fs: remove
> >>>> non-uptodate folio from the page cache in move_data_block").
> >>>
> >>> But I assume, that other FSes can trigger this as well? Any insights?
> >
> > Yes, I think all FSes support big folio could suffer from this defect.
> >
> >>>
> >>>>
> >>>> There are 3 race processes in this scenario, please find below for their
> >>>> main activities. However, by further investigation over the code, I
> >>>> think there is a common race window for the truncated folios between
> >>>> split_folio_to_order and folio_isolate_lru, where the folios lost the
> >>>> refcount on page cache and remains the transient one of the split
> >>>> caller, under which the folio could enter free path and compete with the
> >>>> isolation process. This commit would like to suggest to have the folios
> >>>> beyond EOF stay out of LRU.
> >>>>
> >>>> Truncate:
> >>>> The changed code in move_data_block() lets the GC path evict the tail-end
> >>>> folio from the page cache through folio_end_dropbehind(). Once
> >>>> folio_unmap_invalidate() removes the folio from mapping->i_pages, the
> >>>> page-cache references for all pages in the folio are dropped. The folio
> >>>> is then kept alive only by temporary external references, which allows a
> >>>> later split to operate on a folio whose subpages are no longer protected
> >>>> by page-cache references.
> >>>>
> >>>> Split:
> >>>> After the page-cache references are gone, split_folio_to_order() can
> >>>> split the big folio into individual pages and put the resulting subpages
> >>>> back on the LRU. For tail pages beyond EOF, split removes them from the
> >>>> page cache and drops their page-cache references. A tail page can then
> >>>> remain on the LRU with PG_lru set while holding only the split caller's
> >>>> temporary reference. When free_folio_and_swap_cache() drops that final
> >>>> reference, the page enters the final folio_put() release path.
> >>>>
> >>>> Isolate:
> >>>> In parallel, folio_isolate_lru() can observe the same tail page with a
> >>>> non-zero refcount and PG_lru set. It clears PG_lru before taking its own
> >>>> reference. If this races with the final folio_put() from the split path,
> >>>> __folio_put() sees PG_lru already cleared and skips lruvec_del_folio().
> >>>> The page is then freed back to the allocator while its lru links are
> >>>> still present in the LRU list. A later LRU operation on a neighboring
> >>>> page detects the stale link and reports list corruption.
> >>>
> >>> Complicated mess :(
> >>>
> >>> So, folio_isolate_lru() really only requires the caller to hold a folio
> >>> reference, which can happen given that we did the folio_ref_unfreeze(). It can,
> >>> for example, be triggered by memory offlining or page migration.
> >>>
> >>> So we really want to not allow folio_isolate_lru() while we are still processing
> >>> the folio.
> >>
> >> Or we should defer adding split folios to LRU after unfreeze.
> >>
> >>>
> >>> What your patch does is, simply not add folios that we will drop from the page
> >>> cache to the LRU?
> >>>
> >>>
> >>> You should describe here how you are fixing it: "Let's fix it by..."
> > Yes. This commit would like to suggest to fix it by having the folio
> > skip the lru_add_split_folio
>
> Skipping it causes more issues like LRU counter mismatch, firing up bad_page()
> since PG_active, PG_unevictable, or MGLRU fields in ->flags.f could stay
> uncleared at page free time.

OK, we should solve this issue.
>
> >>>
> >>>>
> >>>> [1]
> >>>> [ 22.486082] list_del corruption. next->prev should be fffffffec10e0ac8, but was dead000000000122. (next=fffffffec10e0a88)
> >>>> [ 22.486130] ------------[ cut here ]------------
> >>>> [ 22.486134] kernel BUG at lib/list_debug.c:67!
> >>>> [ 22.486141] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> >>>> [ 22.488502] Tainted: [W]=WARN, [O]=OOT_MODULE
> >>>> [ 22.488506] Hardware name: Spreadtrum UMS9230 1H10 SoC (DT)
> >>>> [ 22.488511] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >>>> [ 22.488517] pc : __list_del_entry_valid_or_report+0x14c/0x154
> >>>> [ 22.488531] lr : __list_del_entry_valid_or_report+0x14c/0x154
> >>>> [ 22.488539] sp : ffffffc08006b830
> >>>> [ 22.488542] x29: ffffffc08006b868 x28: 0000000000003020 x27: 0000000000000000
> >>>> [ 22.488553] x26: 0000000000000000 x25: 0000000000000004 x24: fffffffec10e0ac0
> >>>> [ 22.488564] x23: 00000000000000e8 x22: 0000000000000024 x21: dead000000000122
> >>>> [ 22.488574] x20: fffffffec10e0a88 x19: fffffffec10e0ac8 x18: ffffffc080061060
> >>>> [ 22.488585] x17: 20747562202c3863 x16: 6130653031636566 x15: 0000000000000058
> >>>> [ 22.488595] x14: 0000000000000004 x13: ffffff80f91e0000 x12: 0000000000000003
> >>>> [ 22.488605] x11: 0000000000000003 x10: 0000000000000001 x9 : ffe85721f0e25f00
> >>>> [ 22.488615] x8 : ffe85721f0e25f00 x7 : 0000000000000000 x6 : 6c65645f7473696c
> >>>> [ 22.488625] x5 : ffffffed39b23026 x4 : 0000000000000000 x3 : 0000000000000010
> >>>> [ 22.488636] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 000000000000006d
> >>>> [ 22.488647] Call trace:
> >>>> [ 22.488651] __list_del_entry_valid_or_report+0x14c/0x154 (P)
> >>>> [ 22.488661] __folio_put+0x2bc/0x434
> >>>> [ 22.488670] folio_put+0x28/0x58
> >>>> [ 22.488678] do_garbage_collect+0x1a34/0x2584
> >>>> [ 22.488689] f2fs_gc+0x230/0x9b4
> >>>> [ 22.488697] f2fs_fallocate+0xb90/0xdf4
> >>>> [ 22.488706] vfs_fallocate+0x1b4/0x2bc
> >>>> [ 22.488716] __arm64_sys_fallocate+0x44/0x78
> >>>> [ 22.488725] invoke_syscall+0x58/0xe4
> >>>> [ 22.488732] do_el0_svc+0x48/0xdc
> >>>> [ 22.488739] el0_svc+0x3c/0x98
> >>>> [ 22.488747] el0t_64_sync_handler+0x20/0x130
> >>>> [ 22.488754] el0t_64_sync+0x1c4/0x1c8
> >>>>
> >>>> [2]
> >>>> CPU0 (f2fs GC) CPU1 (split_folio_to_order) CPU2 (folio_isolate_lru)
> >>>>
> >>>> F: pagecache refs = n
> >>>> F: extra refs = GC + split
> >>>> F: PG_lru set
> >>>> move_data_block()
> >>>> folio = f2fs_grab_cache_folio(F)
> >>>> ...
> >>>> __folio_set_dropbehind(F)
> >>>> folio_unlock(F)
> >>>> folio_end_dropbehind(F)
> >>>> folio_unmap_invalidate(F)
> >>>> __filemap_remove_folio(F)
> >>>> folio_put_refs(F, n)
> >>>> folio_put(F)
> >>>> split_folio_to_order(F)
> >>>> folio_ref_freeze(F, 1)
> >>>> ...
> >>>> lru_add_split_folio(T)
> >>>> list_add_tail(&T->lru, &F->lru)
> >>>> folio_set_lru(T)
> >>>> __filemap_remove_folio(T)
> >>>> folio_put_refs(T, 1)
> >>>> /* T refcount == 1, PageLRU set */
> >>>> free_folio_and_swap_cache(T)
> >>>> folio_put(T)
> >>>> /* refcount: 1 -> 0 */
> >>>> folio_isolate_lru(T)
> >>
> >> If refcount is 0 at this point, VM_BUG_ON_FOLIO(!folio_ref_count(folio), folio) in
> >> folio_isolate_lru() would be triggered. Maybe we could just return false in that case.
> > No, isolate caller will grab one refcount.
>
> As I said in another email, isolate caller cannot grab a refcount when folio refcount
> is 0.

pin_user_pages*(..., FOLL_LONGTERM)
└─ __gup_longterm_locked() [gup.c:2465]
│ ├─ follow_page_pte() [gup.c:802]
│ │ └─ try_grab_folio() [gup.c:858]
if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
return -ENOMEM;

// Could __folio_split->folio_put could
race here ?
if (flags & FOLL_GET)
folio_ref_add(folio, refs);
└─ check_and_migrate_movable_pages() [gup.c:2490]
└─ collect_longterm_unpinnable_folios() [gup.c:2391]
└─ └─if (!folio_isolate_lru(folio))

Could the __folio_split race in the above scenario? It looks like
try_grab_folio set the refcount without using atomic operation.

>(from previous mail)
> Wait, if folio->mapping is NULL and folio is not anonymous,
> folio_check_splittable() returns false at the beginning of
> __folio_split(). So the split cannot happen.

According to my understanding, the folio checked here is still big
folio which is locked and with folio->mapping set, right?
>
> >>
> >>>> folio_test_clear_lru(T)
> >>>> __folio_put(T)
> >>>> __page_cache_release(T)
> >>>> folio_test_lru(T) == false
> >>>> /* skip lruvec_del_folio(T) */
> >>>> free_frozen_pages(T)
> >>>> folio_get(T)
> >>>> lruvec_del_folio(T)
> >>
> >> But in CPU2 (folio_isolate_lru), lruvec_del_folio(T) should remove T from LRU list.
> >>
> >>>> later:
> >>>> list_del(adjacent->lru)
> >>>> next == &T->lru
> >>>> next->prev == LIST_POISON / PCP freelist
> >>>> BUG
> >>>>
> >>
> >> Why does CPU0 still see the stale link from adjacent?
> > The staled link should be from LRU since the folio never be deleted from lru.
> >>
> >>>> Assisted-by: Cursor:claude-opus-4-8
> >>>> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
> >>>
> >>> I'm wondering if this has been broken the whole time, or if some rework allowed
> >>> this to trigger.
> > This issue is from AOSP with v6.18 which just supports big folio in
> > f2fs. Besides, it is triggered by the timing of f2fs's partition get
> > almost full during the test case of filling f2fs's partition(should be
> > the trigger factor of f2fs's gc which enroll truncate thing)
>
> Are you able to reproduce it with other FSes supporting large folio?

Sorry, I can't so far since only f2fs has gc in the Android system.
>
> >>>
> >>> I assume the issue can be triggered for other FSes, and we want Fixes: + CC: stable?
> >>>
> >>> Looking into the history, I think we always unconditionally did the
> >>> lru_add_split_folio()/lru_add_page_tail().
> >>>
> >>>> ---
> >>>> mm/huge_memory.c | 2 +-
> >>>> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>> index 970e077019b7..7465525a94a8 100644
> >>>> --- a/mm/huge_memory.c
> >>>> +++ b/mm/huge_memory.c
> >>>> @@ -3966,7 +3966,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
> >>>> folio_ref_unfreeze(new_folio,
> >>>> folio_cache_ref_count(new_folio) + 1);
> >>>>
> >>>> - if (do_lru)
> >>>> + if (do_lru && !(mapping && new_folio->index >= end))
> >>>
> >>> It might be clearer to write this as
> >>>
> >>> do_lru && (!mapping || new_folio->index < end)
> >>>
> >>> To match the page-cache check further below
> >>>
> >>> if (!mapping)
> >>> continue
> >>>
> >>> ...
> >>> if (new_folio->index < end)
> >>> ...
> >>>
> >>>> lru_add_split_folio(folio, new_folio, lruvec, list);
> >>>>
> >>>> /*
> >>>
> >>> folio_check_splittable() makes sure that we have a mapping for non-anon folios.
> >>> (no truncation). end is then only set for non-anon folios.
> >>>
> >>> @Zi, any thoughts?
> >>
> >> The fix works but I feel that it is masking the race between folio_isolate_lru() and
> >> folio_put(). I worry that the same issue might be triggered in other ways or
> >> in new code if we do not fix the race.
> >>
> >> To summarize my thoughts above:
> >> 1. adding frozen folios in LRU might be problematic, since folio_isolate_lru()
> >> has a VM_BUG_ON_FOLIO() for it but still chooses to proceed the isolation.
> >>
> >> 2. the race analysis is not clear, since both folio_isolate_lru() and folio_put()
> >> do lruvec_del_folio() if folio is on LRU. When list_del(adjacent->lru) sees
> >> the stale link, the folio is already in buddy and page->lru is modified for
> >> PageBuddy use? So even without CPU0, folio_isolate_lru()'s lruvec_del_folio()
> >> can do the wrong thing on pages on buddy?
> >>
> >>
> >> --
> >> Best Regards,
> >> Yan, Zi
>
>
> --
> Best Regards,
> Yan, Zi