Re: [RFC PATCH] mm/huge_memory: do not add dropped split tail folios to LRU
From: Zhaoyang Huang
Date: Thu Jun 11 2026 - 03:45:47 EST
+f2fs and android folks
@jaegeuk ,chao and jyescas, this mailing thread is talking about an
issue which related to f2fs, that is, with the commit 9609dd704725
("f2fs: remove non-uptodate folio from the page cache in
move_data_block") on and off the android's v6.18, we can reproduce or
not the kernel panic reported by this RFC. Could you please have
insight into this or just revert the suspicious commit?
On Thu, Jun 11, 2026 at 11:06 AM Zi Yan <ziy@xxxxxxxxxx> wrote:
>
> On 10 Jun 2026, at 22:39, Zhaoyang Huang wrote:
>
> > On Thu, Jun 11, 2026 at 9:56 AM Zi Yan <ziy@xxxxxxxxxx> wrote:
> >>
> >> On 10 Jun 2026, at 21:39, Zhaoyang Huang wrote:
> >>
> >>> On Wed, Jun 10, 2026 at 10:38 PM Zi Yan <ziy@xxxxxxxxxx> wrote:
> >>>>
> >>>> On 10 Jun 2026, at 8:50, David Hildenbrand (Arm) wrote:
> >>>>
> >>>>> On 6/10/26 14:05, zhaoyang.huang wrote:
> >>>>>> From: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
> >>>>>>
> >>>>>> The kernel panics are keeping to be reported especially when the f2fs
> >>>>>> partition get almost full. By investigation, we find that the reason is
> >>>>>> one f2fs page got freed to buddy without being deleted from LRU and the
> >>>>>> root cause is the race happened in [2] which is enrolled by this commit.
> >>>>>> We solve this issue by reverting a f2fs commit 9609dd704725 ("f2fs: remove
> >>>>>> non-uptodate folio from the page cache in move_data_block").
> >>>>>
> >>>>> But I assume, that other FSes can trigger this as well? Any insights?
> >>>
> >>> Yes, I think all FSes support big folio could suffer from this defect.
> >>>
> >>>>>
> >>>>>>
> >>>>>> There are 3 race processes in this scenario, please find below for their
> >>>>>> main activities. However, by further investigation over the code, I
> >>>>>> think there is a common race window for the truncated folios between
> >>>>>> split_folio_to_order and folio_isolate_lru, where the folios lost the
> >>>>>> refcount on page cache and remains the transient one of the split
> >>>>>> caller, under which the folio could enter free path and compete with the
> >>>>>> isolation process. This commit would like to suggest to have the folios
> >>>>>> beyond EOF stay out of LRU.
> >>>>>>
> >>>>>> Truncate:
> >>>>>> The changed code in move_data_block() lets the GC path evict the tail-end
> >>>>>> folio from the page cache through folio_end_dropbehind(). Once
> >>>>>> folio_unmap_invalidate() removes the folio from mapping->i_pages, the
> >>>>>> page-cache references for all pages in the folio are dropped. The folio
> >>>>>> is then kept alive only by temporary external references, which allows a
> >>>>>> later split to operate on a folio whose subpages are no longer protected
> >>>>>> by page-cache references.
> >>>>>>
> >>>>>> Split:
> >>>>>> After the page-cache references are gone, split_folio_to_order() can
> >>>>>> split the big folio into individual pages and put the resulting subpages
> >>>>>> back on the LRU. For tail pages beyond EOF, split removes them from the
> >>>>>> page cache and drops their page-cache references. A tail page can then
> >>>>>> remain on the LRU with PG_lru set while holding only the split caller's
> >>>>>> temporary reference. When free_folio_and_swap_cache() drops that final
> >>>>>> reference, the page enters the final folio_put() release path.
> >>>>>>
> >>>>>> Isolate:
> >>>>>> In parallel, folio_isolate_lru() can observe the same tail page with a
> >>>>>> non-zero refcount and PG_lru set. It clears PG_lru before taking its own
> >>>>>> reference. If this races with the final folio_put() from the split path,
> >>>>>> __folio_put() sees PG_lru already cleared and skips lruvec_del_folio().
> >>>>>> The page is then freed back to the allocator while its lru links are
> >>>>>> still present in the LRU list. A later LRU operation on a neighboring
> >>>>>> page detects the stale link and reports list corruption.
> >>>>>
> >>>>> Complicated mess :(
> >>>>>
> >>>>> So, folio_isolate_lru() really only requires the caller to hold a folio
> >>>>> reference, which can happen given that we did the folio_ref_unfreeze(). It can,
> >>>>> for example, be triggered by memory offlining or page migration.
> >>>>>
> >>>>> So we really want to not allow folio_isolate_lru() while we are still processing
> >>>>> the folio.
> >>>>
> >>>> Or we should defer adding split folios to LRU after unfreeze.
> >>>>
> >>>>>
> >>>>> What your patch does is, simply not add folios that we will drop from the page
> >>>>> cache to the LRU?
> >>>>>
> >>>>>
> >>>>> You should describe here how you are fixing it: "Let's fix it by..."
> >>> Yes. This commit would like to suggest to fix it by having the folio
> >>> skip the lru_add_split_folio
> >>
> >> Skipping it causes more issues like LRU counter mismatch, firing up bad_page()
> >> since PG_active, PG_unevictable, or MGLRU fields in ->flags.f could stay
> >> uncleared at page free time.
> >
> > OK, we should solve this issue.
> >>
> >>>>>
> >>>>>>
> >>>>>> [1]
> >>>>>> [ 22.486082] list_del corruption. next->prev should be fffffffec10e0ac8, but was dead000000000122. (next=fffffffec10e0a88)
> >>>>>> [ 22.486130] ------------[ cut here ]------------
> >>>>>> [ 22.486134] kernel BUG at lib/list_debug.c:67!
> >>>>>> [ 22.486141] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> >>>>>> [ 22.488502] Tainted: [W]=WARN, [O]=OOT_MODULE
> >>>>>> [ 22.488506] Hardware name: Spreadtrum UMS9230 1H10 SoC (DT)
> >>>>>> [ 22.488511] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >>>>>> [ 22.488517] pc : __list_del_entry_valid_or_report+0x14c/0x154
> >>>>>> [ 22.488531] lr : __list_del_entry_valid_or_report+0x14c/0x154
> >>>>>> [ 22.488539] sp : ffffffc08006b830
> >>>>>> [ 22.488542] x29: ffffffc08006b868 x28: 0000000000003020 x27: 0000000000000000
> >>>>>> [ 22.488553] x26: 0000000000000000 x25: 0000000000000004 x24: fffffffec10e0ac0
> >>>>>> [ 22.488564] x23: 00000000000000e8 x22: 0000000000000024 x21: dead000000000122
> >>>>>> [ 22.488574] x20: fffffffec10e0a88 x19: fffffffec10e0ac8 x18: ffffffc080061060
> >>>>>> [ 22.488585] x17: 20747562202c3863 x16: 6130653031636566 x15: 0000000000000058
> >>>>>> [ 22.488595] x14: 0000000000000004 x13: ffffff80f91e0000 x12: 0000000000000003
> >>>>>> [ 22.488605] x11: 0000000000000003 x10: 0000000000000001 x9 : ffe85721f0e25f00
> >>>>>> [ 22.488615] x8 : ffe85721f0e25f00 x7 : 0000000000000000 x6 : 6c65645f7473696c
> >>>>>> [ 22.488625] x5 : ffffffed39b23026 x4 : 0000000000000000 x3 : 0000000000000010
> >>>>>> [ 22.488636] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 000000000000006d
> >>>>>> [ 22.488647] Call trace:
> >>>>>> [ 22.488651] __list_del_entry_valid_or_report+0x14c/0x154 (P)
> >>>>>> [ 22.488661] __folio_put+0x2bc/0x434
> >>>>>> [ 22.488670] folio_put+0x28/0x58
> >>>>>> [ 22.488678] do_garbage_collect+0x1a34/0x2584
> >>>>>> [ 22.488689] f2fs_gc+0x230/0x9b4
> >>>>>> [ 22.488697] f2fs_fallocate+0xb90/0xdf4
> >>>>>> [ 22.488706] vfs_fallocate+0x1b4/0x2bc
> >>>>>> [ 22.488716] __arm64_sys_fallocate+0x44/0x78
> >>>>>> [ 22.488725] invoke_syscall+0x58/0xe4
> >>>>>> [ 22.488732] do_el0_svc+0x48/0xdc
> >>>>>> [ 22.488739] el0_svc+0x3c/0x98
> >>>>>> [ 22.488747] el0t_64_sync_handler+0x20/0x130
> >>>>>> [ 22.488754] el0t_64_sync+0x1c4/0x1c8
> >>>>>>
> >>>>>> [2]
> >>>>>> CPU0 (f2fs GC) CPU1 (split_folio_to_order) CPU2 (folio_isolate_lru)
> >>>>>>
> >>>>>> F: pagecache refs = n
> >>>>>> F: extra refs = GC + split
> >>>>>> F: PG_lru set
> >>>>>> move_data_block()
> >>>>>> folio = f2fs_grab_cache_folio(F)
> >>>>>> ...
> >>>>>> __folio_set_dropbehind(F)
> >>>>>> folio_unlock(F)
> >>>>>> folio_end_dropbehind(F)
> >>>>>> folio_unmap_invalidate(F)
> >>>>>> __filemap_remove_folio(F)
> >>>>>> folio_put_refs(F, n)
> >>>>>> folio_put(F)
> >>>>>> split_folio_to_order(F)
> >>>>>> folio_ref_freeze(F, 1)
> >>>>>> ...
> >>>>>> lru_add_split_folio(T)
> >>>>>> list_add_tail(&T->lru, &F->lru)
> >>>>>> folio_set_lru(T)
> >>>>>> __filemap_remove_folio(T)
> >>>>>> folio_put_refs(T, 1)
> >>>>>> /* T refcount == 1, PageLRU set */
> >>>>>> free_folio_and_swap_cache(T)
> >>>>>> folio_put(T)
> >>>>>> /* refcount: 1 -> 0 */
> >>>>>> folio_isolate_lru(T)
> >>>>
> >>>> If refcount is 0 at this point, VM_BUG_ON_FOLIO(!folio_ref_count(folio), folio) in
> >>>> folio_isolate_lru() would be triggered. Maybe we could just return false in that case.
> >>> No, isolate caller will grab one refcount.
> >>
> >> As I said in another email, isolate caller cannot grab a refcount when folio refcount
> >> is 0.
> >
> > pin_user_pages*(..., FOLL_LONGTERM)
> > └─ __gup_longterm_locked() [gup.c:2465]
> > │ ├─ follow_page_pte() [gup.c:802]
> > │ │ └─ try_grab_folio() [gup.c:858]
> > if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
> > return -ENOMEM;
> >
> > // Could __folio_split->folio_put could
> > race here ?
> > if (flags & FOLL_GET)
> > folio_ref_add(folio, refs);
> > └─ check_and_migrate_movable_pages() [gup.c:2490]
> > └─ collect_longterm_unpinnable_folios() [gup.c:2391]
> > └─ └─if (!folio_isolate_lru(folio))
> >
> > Could the __folio_split race in the above scenario? It looks like
> > try_grab_folio set the refcount without using atomic operation.
>
> folio_ref_add() used by try_grab_folio() is an atomic op.
> Which refcount change is not atomic here?
The atomic I mean is folio_try_get is implemented by
atomic_add_unless, while try_grab_folio does this by the below
sequence which leaves a window to have __folio_split race with it.
right?
if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
....
if (flags & FOLL_GET)
folio_ref_add(folio, refs);
>
> In addition, who is GUPing f2fs folio?
Don't know yet.
>
> I think you need to find the actual f2fs code path instead of
> chasing theoretical code combinations.
The test case get passed by reverting the commit of
folio_end_dropbehind which encourage us to believe this is the clue.
>
> >
> >> (from previous mail)
> >> Wait, if folio->mapping is NULL and folio is not anonymous,
> >> folio_check_splittable() returns false at the beginning of
> >> __folio_split(). So the split cannot happen.
> >
> > According to my understanding, the folio checked here is still big
> > folio which is locked and with folio->mapping set, right?
>
> But the provided trace says the folio is split after folio_end_dropbehind(F)
> and folio->mapping is NULL.
Please find below for more information of the coredump. We can know
the BUG_ON information that the folio just under list_del is
fffffffec096e440 while its lru.next folio fffffffec096e480 is the one
which get freed to PCP without lruvec_del_folio wrongly[1]. We can
also find that that 'folio(0xfffffffec096e440)->lru.prev =
fffffffec0f639c0' in which fffffffec0f639c0 is an alone index folio
within the page cache that looks like the result of the fallocate[3].
So if it is possible that the split happens prior to fallocate and
then the folio got truncate and free_folio_and_swap_cache race with
folio_isolate_lru?
[1]
[ 22.339229] list_del corruption. next->prev should be
fffffffec096e448, but was ffffff80f9791830. (next=fffffffec096e488)
struct page 0xfffffffec096e440 {
lru = {
next = 0xfffffffec096e488,
prev = 0xfffffffec096e408
[2]
fffffffec096e440 a5b91000 0 18 0 24 referenced,lru
fffffffec096e480 a5b92000 ffffff801e930481 73009e9 1 41028
uptodate,lru,owner_2,swapbacked
fffffffec096e4c0 a5b93000 ffffff801e930481 730033a 1 41028
uptodate,lru,owner_2,swapbacked
[3]
fffffffec33f9440
index: 76446 position: root/0/18/42/30
fffffffec00da9c0
index: 76448 position: root/0/18/42/32
fffffffec3ded040
index: 76449 position: root/0/18/42/33
fffffffec0f639c0
index: 6188581 position: root/23/38/56/37
fffffffec0f63a00
index: 6188853 position: root/23/38/60/53
fffffffec0f63a40
index: 6188854 position: root/23/38/60/54
[4]
CPU0 (f2fs GC) CPU1 (split_folio_to_order)
CPU2 (folio_isolate_lru)
split_folio_to_order(F)
folio_ref_freeze(F, 1)
...
lru_add_split_folio(T)
list_add_tail(&T->lru, &F->lru)
folio_set_lru(T)
__filemap_remove_folio(T)
folio_put_refs(T, 1)
folio_unlock(new_folio);
move_data_block()
folio = f2fs_grab_cache_folio(F)
...
__folio_set_dropbehind(F)
folio_unlock(F)
folio_end_dropbehind(F)
folio_unmap_invalidate(F)
__filemap_remove_folio(F)
folio_put_refs(F, n)
folio_put(F)
/* T refcount == 1, PageLRU set */
free_folio_and_swap_cache(T)
folio_put(T)
/* refcount: 1 -> 0 */
folio_isolate_lru(T)
folio_test_clear_lru(T)
__folio_put(T)
__page_cache_release(T)
folio_test_lru(T) == false
/* skip lruvec_del_folio(T) */
free_frozen_pages(T)
folio_get(T)
lruvec_del_folio(T)
>
> >>
> >>>>
> >>>>>> folio_test_clear_lru(T)
> >>>>>> __folio_put(T)
> >>>>>> __page_cache_release(T)
> >>>>>> folio_test_lru(T) == false
> >>>>>> /* skip lruvec_del_folio(T) */
> >>>>>> free_frozen_pages(T)
> >>>>>> folio_get(T)
> >>>>>> lruvec_del_folio(T)
> >>>>
> >>>> But in CPU2 (folio_isolate_lru), lruvec_del_folio(T) should remove T from LRU list.
> >>>>
> >>>>>> later:
> >>>>>> list_del(adjacent->lru)
> >>>>>> next == &T->lru
> >>>>>> next->prev == LIST_POISON / PCP freelist
> >>>>>> BUG
> >>>>>>
> >>>>
> >>>> Why does CPU0 still see the stale link from adjacent?
> >>> The staled link should be from LRU since the folio never be deleted from lru.
> >>>>
> >>>>>> Assisted-by: Cursor:claude-opus-4-8
> >>>>>> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
> >>>>>
> >>>>> I'm wondering if this has been broken the whole time, or if some rework allowed
> >>>>> this to trigger.
> >>> This issue is from AOSP with v6.18 which just supports big folio in
> >>> f2fs. Besides, it is triggered by the timing of f2fs's partition get
> >>> almost full during the test case of filling f2fs's partition(should be
> >>> the trigger factor of f2fs's gc which enroll truncate thing)
> >>
> >> Are you able to reproduce it with other FSes supporting large folio?
> >
> > Sorry, I can't so far since only f2fs has gc in the Android system.
>
> Have you checked f2fs gc code to make sure it is working correctly?
> BTW, what makes you think the issue is related to folio_split()?
> Can you elaborate more on your investigation?
>
> Thanks.
>
>
> --
> Best Regards,
> Yan, Zi