Re: [PATCHv2] mm/huge_memory: do not add dropped split tail folios to LRU

From: Zi Yan

Date: Fri Jun 12 2026 - 09:59:23 EST

On 11 Jun 2026, at 23:14, Zhaoyang Huang wrote:

> On Fri, Jun 12, 2026 at 10:46 AM Zi Yan <ziy@xxxxxxxxxx> wrote:
>>
>> On 11 Jun 2026, at 22:34, zhaoyang.huang wrote:
>>
>>> From: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
>>>
>>> The kernel panics are keeping to be reported especially when the f2fs
>>> partition get almost full. By investigation, we find that the reason is
>>> one f2fs page got freed to buddy without being deleted from LRU and the
>>> root cause is the race happened in [2] which is enrolled by this commit.
>>> We solve this issue by reverting a f2fs commit 9609dd704725 ("f2fs: remove
>>> non-uptodate folio from the page cache in move_data_block").
>>>
>>> There are 3 race processes in this scenario, please find below for their
>>> main activities. However, by further investigation over the code, I
>>> think there is a common race window for the truncated folios between
>>> split_folio_to_order and folio_isolate_lru, where the folios lost the
>>> refcount on page cache and remains the transient one of the split
>>> caller, under which the folio could enter free path and compete with the
>>> isolation process. This commit would like to suggest to have the folios
>>> beyond EOF stay out of LRU.
>>>
>>> Split:
>>> split_folio_to_order() can split the big folio into individual pages and
>>> put the resulting subpages back on the LRU. For tail pages beyond EOF,
>>> split removes them from the page cache and drops their page-cache
>>> references. A tail page can then remain on the LRU with PG_lru set while
>>> holding only the split caller's temporary reference. When
>>> free_folio_and_swap_cache() drops that final reference, the page enters
>>> the final folio_put() release path.
>>>
>>> Truncate:
>>> The changed code in move_data_block() lets the GC path evict the tail-end
>>> folio from the page cache through folio_end_dropbehind(). Once
>>> folio_unmap_invalidate() removes the folio from mapping->i_pages, the
>>> page-cache references for all pages in the folio are dropped. The folio
>>> is then kept alive only by temporary external references, which allows a
>>> later split to operate on a folio whose subpages are no longer protected
>>> by page-cache references.
>>>
>>> Isolate:
>>> In parallel, folio_isolate_lru() can observe the same tail page with a
>>> non-zero refcount and PG_lru set. It clears PG_lru before taking its own
>>> reference. If this races with the final folio_put() from the split path,
>>> __folio_put() sees PG_lru already cleared and skips lruvec_del_folio().
>>> The page is then freed back to the allocator while its lru links are
>>> still present in the LRU list. A later LRU operation on a neighboring
>>> page detects the stale link and reports list corruption.
>>>
>>> [1]
>>> [ 22.486082] list_del corruption. next->prev should be fffffffec10e0ac8, but was dead000000000122. (next=fffffffec10e0a88)
>>> [ 22.486130] ------------[ cut here ]------------
>>> [ 22.486134] kernel BUG at lib/list_debug.c:67!
>>> [ 22.486141] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>>> [ 22.488502] Tainted: [W]=WARN, [O]=OOT_MODULE
>>> [ 22.488506] Hardware name: Spreadtrum UMS9230 1H10 SoC (DT)
>>> [ 22.488511] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [ 22.488517] pc : __list_del_entry_valid_or_report+0x14c/0x154
>>> [ 22.488531] lr : __list_del_entry_valid_or_report+0x14c/0x154
>>> [ 22.488539] sp : ffffffc08006b830
>>> [ 22.488542] x29: ffffffc08006b868 x28: 0000000000003020 x27: 0000000000000000
>>> [ 22.488553] x26: 0000000000000000 x25: 0000000000000004 x24: fffffffec10e0ac0
>>> [ 22.488564] x23: 00000000000000e8 x22: 0000000000000024 x21: dead000000000122
>>> [ 22.488574] x20: fffffffec10e0a88 x19: fffffffec10e0ac8 x18: ffffffc080061060
>>> [ 22.488585] x17: 20747562202c3863 x16: 6130653031636566 x15: 0000000000000058
>>> [ 22.488595] x14: 0000000000000004 x13: ffffff80f91e0000 x12: 0000000000000003
>>> [ 22.488605] x11: 0000000000000003 x10: 0000000000000001 x9 : ffe85721f0e25f00
>>> [ 22.488615] x8 : ffe85721f0e25f00 x7 : 0000000000000000 x6 : 6c65645f7473696c
>>> [ 22.488625] x5 : ffffffed39b23026 x4 : 0000000000000000 x3 : 0000000000000010
>>> [ 22.488636] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 000000000000006d
>>> [ 22.488647] Call trace:
>>> [ 22.488651] __list_del_entry_valid_or_report+0x14c/0x154 (P)
>>> [ 22.488661] __folio_put+0x2bc/0x434
>>> [ 22.488670] folio_put+0x28/0x58
>>> [ 22.488678] do_garbage_collect+0x1a34/0x2584
>>> [ 22.488689] f2fs_gc+0x230/0x9b4
>>> [ 22.488697] f2fs_fallocate+0xb90/0xdf4
>>> [ 22.488706] vfs_fallocate+0x1b4/0x2bc
>>> [ 22.488716] __arm64_sys_fallocate+0x44/0x78
>>> [ 22.488725] invoke_syscall+0x58/0xe4
>>> [ 22.488732] do_el0_svc+0x48/0xdc
>>> [ 22.488739] el0_svc+0x3c/0x98
>>> [ 22.488747] el0t_64_sync_handler+0x20/0x130
>>> [ 22.488754] el0t_64_sync+0x1c4/0x1c8
>>>
>>> [2]
>>> *F: big folio before split
>>> *T: tail folio after split
>>> CPU0 (f2fs GC) CPU1 (split_folio_to_order) CPU2 (folio_isolate_lru)
>>> *F: pagecache refs = n
>>> *F: extra refs = split
>>> *F: PG_lru set, mapping != NULL
>>> split_folio_to_order(F)
>>> folio_ref_freeze(F, 1)
>>> ...
>>> lru_add_split_folio(T)
>>> list_add_tail(&T->lru, &F->lru)
>>> folio_set_lru(T)
>>> folio_unlock(T)
>>> /* T PageLRU set */
>>>
>>> *T: pagecache refs = 1
>>> *T: extra refs = GC + split
>>> *T: PG_lru set, mapping != NULL
>>>
>>> move_data_block()
>>> folio = f2fs_grab_cache_folio(T)
>>> ...
>>> __folio_set_dropbehind(T)
>>> folio_unlock(T)
>>> folio_end_dropbehind(T)
>>> folio_unmap_invalidate(T)
>>> __filemap_remove_folio(T)
>>> folio_put_refs(T, 1)
>>> folio_put(T)
>>>
>>> *T: pagecache refs = 0
>>> *T: extra refs = split
>>> *T: PG_lru set, mapping == NULL
>>> free_folio_and_swap_cache(T)
>>> folio_put_testzero(T)
>>> /* refcount: 1 -> 0 */
>>>
>>> *T: pagecache refs = 0
>>> *T: extra refs = isolate
>>> *T: PG_lru set, mapping == NULL
>>> folio_isolate_lru(T)
>>> folio_test_clear_lru(T)
>>> __folio_put(T)
>>> __page_cache_release(T)
>>> folio_test_lru(T) == false
>>> /* skip lruvec_del_folio(T) */
>>> free_frozen_pages(T)
>>> folio_get(T)
>>> lruvec_del_folio(T)
>>> later:
>>> list_del(adjacent->lru)
>>> next == &T->lru
>>> next->prev == LIST_POISON / PCP freelist
>>> BUG
>>>
>>> Assisted-by: Cursor:claude-opus-4-8
>>> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
>>> ---
>>> patchv2: update codes to eliminate bad page status
>>
>> Please do not send a new version until we figure out the actual issue.
>> And this version still has other issues, like unbalancing LRU counters.
> ok, I just think nobody would respond to my latest feedback and want
> to fix the bad page status.

Please be patient. Wait for several days if not a week.

>
> I would like to explain more about the current scenario by taking
> below codes as examples. move_folios_to_lru have the folio be set with

Before you continue to point fingers to different code, please clarify
the context for this issue:

1. In first paragraph, you said the issue is solved by reverting
commit 9609dd704725, why do you think the issue comes from MM code?

2. You mentioned that the issue is from AOSP with v6.18 kernel, is
your patch targeting v6.18? Is it reproducible on upstream kernel?

3. Can you provide your kernel configuration file?

4. Can you share a reproducer program for debugging?

Thanks.

> PG_lru before calling folio_put_testzero to ensure there is no
> free_page without holding lruvec_lock, which is similar to this case.
> free_folio_and_swap_cache within __folio_split has no synchronization
> method with folio_isolate_lru which makes the bug happen, right?
>
> static unsigned int move_folios_to_lru(struct lruvec *lruvec,
> struct list_head *list)
> {
> ...
> /*
> * The folio_set_lru needs to be kept here for list integrity.
> * Otherwise:
> * #0 move_folios_to_lru #1 release_pages
> * if (!folio_put_testzero())
> * if (folio_put_testzero())
> * !lru //skip lru_lock
> * folio_set_lru()
> * list_add(&folio->lru,)
> * list_add(&folio->lru,)
> */
> folio_set_lru(folio);
>
> if (unlikely(folio_put_testzero(folio))) {
> __folio_clear_lru_flags(folio);
>
>
>>
>>> ---
>>> ---
>>> mm/huge_memory.c | 22 +++++++++++++++++++++-
>>> 1 file changed, 21 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 970e077019b7..c24c12f71157 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3878,6 +3878,23 @@ static unsigned int folio_cache_ref_count(const struct folio *folio)
>>> return folio_nr_pages(folio);
>>> }
>>>
>>> +static void clear_dropped_split_folio_lru_flags(struct folio *folio)
>>> +{
>>> + /*
>>> + * __split_folio_to_order() clones these LRU state bits from the
>>> + * original folio. A folio that is dropped instead of being added to
>>> + * the LRU will not pass through lruvec_del_folio() and
>>> + * __folio_clear_lru_flags(), so clear the cloned state before it is
>>> + * freed back to the page allocator.
>>> + */
>>> + set_mask_bits(&folio->flags.f,
>>> + (1UL << PG_referenced) | (1UL << PG_active) |
>>> + (1UL << PG_workingset) |
>>> + (1UL << PG_unevictable) | __PG_MLOCKED |
>>> + LRU_GEN_MASK | LRU_REFS_MASK,
>>> + 0);
>>> +}
>>> +
>>> static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int new_order,
>>> struct page *split_at, struct xa_state *xas,
>>> struct address_space *mapping, bool do_lru,
>>> @@ -3958,6 +3975,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
>>> for (new_folio = folio_next(folio); new_folio != end_folio;
>>> new_folio = next) {
>>> unsigned long nr_pages = folio_nr_pages(new_folio);
>>> + bool drop = mapping && new_folio->index >= end;
>>>
>>> next = folio_next(new_folio);
>>>
>>> @@ -3966,7 +3984,9 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
>>> folio_ref_unfreeze(new_folio,
>>> folio_cache_ref_count(new_folio) + 1);
>>>
>>> - if (do_lru)
>>> + if (drop)
>>> + clear_dropped_split_folio_lru_flags(new_folio);
>>> + else if (do_lru)
>>> lru_add_split_folio(folio, new_folio, lruvec, list);
>>>
>>> /*
>>> --
>>> 2.25.1
>>
>>
>> --
>> Best Regards,
>> Yan, Zi

Best Regards,
Yan, Zi