Re: [RFC PATCH] mm/huge_memory: do not add dropped split tail folios to LRU

From: Zi Yan

Date: Wed Jun 10 2026 - 21:50:24 EST

On 10 Jun 2026, at 21:19, Zhaoyang Huang wrote:

> On Thu, Jun 11, 2026 at 2:44 AM Zi Yan <ziy@xxxxxxxxxx> wrote:
>>
>> On 10 Jun 2026, at 13:25, Zi Yan wrote:
>>
>>> On 10 Jun 2026, at 10:38, Zi Yan wrote:
>>>
>>>> On 10 Jun 2026, at 8:50, David Hildenbrand (Arm) wrote:
>>>>
>>>>> On 6/10/26 14:05, zhaoyang.huang wrote:
>>>>>> From: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
>>>>>>
>>>>>> The kernel panics are keeping to be reported especially when the f2fs
>>>>>> partition get almost full. By investigation, we find that the reason is
>>>>>> one f2fs page got freed to buddy without being deleted from LRU and the
>>>>>> root cause is the race happened in [2] which is enrolled by this commit.
>>>>>> We solve this issue by reverting a f2fs commit 9609dd704725 ("f2fs: remove
>>>>>> non-uptodate folio from the page cache in move_data_block").
>>>>>
>>>>> But I assume, that other FSes can trigger this as well? Any insights?
>>>>>
>>>>>>
>>>>>> There are 3 race processes in this scenario, please find below for their
>>>>>> main activities. However, by further investigation over the code, I
>>>>>> think there is a common race window for the truncated folios between
>>>>>> split_folio_to_order and folio_isolate_lru, where the folios lost the
>>>>>> refcount on page cache and remains the transient one of the split
>>>>>> caller, under which the folio could enter free path and compete with the
>>>>>> isolation process. This commit would like to suggest to have the folios
>>>>>> beyond EOF stay out of LRU.
>>>>>>
>>>>>> Truncate:
>>>>>> The changed code in move_data_block() lets the GC path evict the tail-end
>>>>>> folio from the page cache through folio_end_dropbehind(). Once
>>>>>> folio_unmap_invalidate() removes the folio from mapping->i_pages, the
>>>>>> page-cache references for all pages in the folio are dropped. The folio
>>>>>> is then kept alive only by temporary external references, which allows a
>>>>>> later split to operate on a folio whose subpages are no longer protected
>>>>>> by page-cache references.
>>>>>>
>>>>>> Split:
>>>>>> After the page-cache references are gone, split_folio_to_order() can
>>>>>> split the big folio into individual pages and put the resulting subpages
>>>>>> back on the LRU. For tail pages beyond EOF, split removes them from the
>>>>>> page cache and drops their page-cache references. A tail page can then
>>>>>> remain on the LRU with PG_lru set while holding only the split caller's
>>>>>> temporary reference. When free_folio_and_swap_cache() drops that final
>>>>>> reference, the page enters the final folio_put() release path.
>>>>>>
>>>>>> Isolate:
>>>>>> In parallel, folio_isolate_lru() can observe the same tail page with a
>>>>>> non-zero refcount and PG_lru set. It clears PG_lru before taking its own
>>>>>> reference. If this races with the final folio_put() from the split path,
>>>>>> __folio_put() sees PG_lru already cleared and skips lruvec_del_folio().
>>>>>> The page is then freed back to the allocator while its lru links are
>>>>>> still present in the LRU list. A later LRU operation on a neighboring
>>>>>> page detects the stale link and reports list corruption.
>>
>> Something is wrong here with the caller of folio_isolate_lru(), since
>> folio_isolate_lru() requires the caller to take an elevated refcount.
>> This means when entering folio_isolate_lru(), the EOF folio should have
>> at least refcount == 2, 1 from folio_split(), 1 from the caller of
>> folio_isolate_lru(). This should prevent the EOF folio being freed
>> by the parallel __folio_put().
> This is one of the key points for this issue. Could the isolate caller
> grab the refcount(by folio_get but not folio_try_get) after the
> spliter's folio_put->folio_put_testzero? If it may, then the panic
> happens
>
> CPU1 (split_folio_to_order) CPU2
> (folio_isolate_lru)
>
> split_folio_to_order(F)
> folio_ref_freeze(F, 1)
> ...
> lru_add_split_folio(T)
> list_add_tail(&T->lru, &F->lru)
> folio_set_lru(T)
> __filemap_remove_folio(T)
> folio_put_refs(T, 1)
> /* T refcount == 1, PageLRU set */
> free_folio_and_swap_cache(T)
> folio_put(T)
> /* refcount: 1 -> 0 */
>
> //caller grab the refcount here?

Which caller calls folio_get() instead of folio_try_get()?
Claude does not find any caller doing folio_get() + folio_isolate_lru(),
except migrate_device_unmap(), which holds a page table
lock to make sure the folio has a mapping and non-zero ref.

Even with folio_get(), it has
VM_BUG_ON_FOLIO(folio_ref_zero_or_close_to_overflow(folio), folio),
which prevents caller from elevating 0-refcounted folios,
unless your runs did not have DEBUG_VM enabled.

>
> folio_isolate_lru(T)
>
> folio_test_clear_lru(T)
> __folio_put(T)
> __page_cache_release(T)
> folio_test_lru(T) == false
> /* skip lruvec_del_folio(T) */
> free_frozen_pages(T)
> folio_get(T)
>
> lruvec_del_folio(T)
>>
>> Hi Zhaoyang, can you elaborate on the folio_isolate_lru() caller?
> Sorry, no. Split and isolate thing are merely assumption by the phenomenons.
>>
>> In addition (with the help of Claude), the race trace[2] below
>> looks invalid. It says split happens after folio_end_dropbehind(),
>> which sets folio->mapping to NULL, but __folio_split() returns -EBUSY
>> when folio->mapping is NULL in filemap_release_folio() check.
>> So the split cannot happen.
> Could the folio_needs_release return false?

Wait, if folio->mapping is NULL and folio is not anonymous,
folio_check_splittable() returns false at the beginning of
__folio_split(). So the split cannot happen.

>
> if (!folio_needs_release(folio))
> return true;
>
>>
>> Now I am not sure if the bug report is valid or not. At least for
>> folio_split() and folio_isolate_lru(), the race should not exist.
>> But let me know if I miss anything.
>>
>>>>>
>>>>> Complicated mess :(
>>>>>
>>>>> So, folio_isolate_lru() really only requires the caller to hold a folio
>>>>> reference, which can happen given that we did the folio_ref_unfreeze(). It can,
>>>>> for example, be triggered by memory offlining or page migration.
>>>>>
>>>>> So we really want to not allow folio_isolate_lru() while we are still processing
>>>>> the folio.
>>>>
>>>> Or we should defer adding split folios to LRU after unfreeze.
>>>>
>>>>>
>>>>> What your patch does is, simply not add folios that we will drop from the page
>>>>> cache to the LRU?
>>>>>
>>>>>
>>>>> You should describe here how you are fixing it: "Let's fix it by..."
>>>>>
>>>>>>
>>>>>> [1]
>>>>>> [ 22.486082] list_del corruption. next->prev should be fffffffec10e0ac8, but was dead000000000122. (next=fffffffec10e0a88)
>>>>>> [ 22.486130] ------------[ cut here ]------------
>>>>>> [ 22.486134] kernel BUG at lib/list_debug.c:67!
>>>>>> [ 22.486141] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>>>>>> [ 22.488502] Tainted: [W]=WARN, [O]=OOT_MODULE
>>>>>> [ 22.488506] Hardware name: Spreadtrum UMS9230 1H10 SoC (DT)
>>>>>> [ 22.488511] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>> [ 22.488517] pc : __list_del_entry_valid_or_report+0x14c/0x154
>>>>>> [ 22.488531] lr : __list_del_entry_valid_or_report+0x14c/0x154
>>>>>> [ 22.488539] sp : ffffffc08006b830
>>>>>> [ 22.488542] x29: ffffffc08006b868 x28: 0000000000003020 x27: 0000000000000000
>>>>>> [ 22.488553] x26: 0000000000000000 x25: 0000000000000004 x24: fffffffec10e0ac0
>>>>>> [ 22.488564] x23: 00000000000000e8 x22: 0000000000000024 x21: dead000000000122
>>>>>> [ 22.488574] x20: fffffffec10e0a88 x19: fffffffec10e0ac8 x18: ffffffc080061060
>>>>>> [ 22.488585] x17: 20747562202c3863 x16: 6130653031636566 x15: 0000000000000058
>>>>>> [ 22.488595] x14: 0000000000000004 x13: ffffff80f91e0000 x12: 0000000000000003
>>>>>> [ 22.488605] x11: 0000000000000003 x10: 0000000000000001 x9 : ffe85721f0e25f00
>>>>>> [ 22.488615] x8 : ffe85721f0e25f00 x7 : 0000000000000000 x6 : 6c65645f7473696c
>>>>>> [ 22.488625] x5 : ffffffed39b23026 x4 : 0000000000000000 x3 : 0000000000000010
>>>>>> [ 22.488636] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 000000000000006d
>>>>>> [ 22.488647] Call trace:
>>>>>> [ 22.488651] __list_del_entry_valid_or_report+0x14c/0x154 (P)
>>>>>> [ 22.488661] __folio_put+0x2bc/0x434
>>>>>> [ 22.488670] folio_put+0x28/0x58
>>>>>> [ 22.488678] do_garbage_collect+0x1a34/0x2584
>>>>>> [ 22.488689] f2fs_gc+0x230/0x9b4
>>>>>> [ 22.488697] f2fs_fallocate+0xb90/0xdf4
>>>>>> [ 22.488706] vfs_fallocate+0x1b4/0x2bc
>>>>>> [ 22.488716] __arm64_sys_fallocate+0x44/0x78
>>>>>> [ 22.488725] invoke_syscall+0x58/0xe4
>>>>>> [ 22.488732] do_el0_svc+0x48/0xdc
>>>>>> [ 22.488739] el0_svc+0x3c/0x98
>>>>>> [ 22.488747] el0t_64_sync_handler+0x20/0x130
>>>>>> [ 22.488754] el0t_64_sync+0x1c4/0x1c8
>>>>>>
>>>>>> [2]
>>>>>> CPU0 (f2fs GC) CPU1 (split_folio_to_order) CPU2 (folio_isolate_lru)
>>>>>>
>>>>>> F: pagecache refs = n
>>>>>> F: extra refs = GC + split
>>>>>> F: PG_lru set
>>>>>> move_data_block()
>>>>>> folio = f2fs_grab_cache_folio(F)
>>>>>> ...
>>>>>> __folio_set_dropbehind(F)
>>>>>> folio_unlock(F)
>>>>>> folio_end_dropbehind(F)
>>>>>> folio_unmap_invalidate(F)
>>>>>> __filemap_remove_folio(F)
>>>>>> folio_put_refs(F, n)
>>>>>> folio_put(F)
>>>>>> split_folio_to_order(F)
>>>>>> folio_ref_freeze(F, 1)
>>>>>> ...
>>>>>> lru_add_split_folio(T)
>>>>>> list_add_tail(&T->lru, &F->lru)
>>>>>> folio_set_lru(T)
>>>>>> __filemap_remove_folio(T)
>>>>>> folio_put_refs(T, 1)
>>>>>> /* T refcount == 1, PageLRU set */
>>>>>> free_folio_and_swap_cache(T)
>>>>>> folio_put(T)
>>>>>> /* refcount: 1 -> 0 */
>>>>>> folio_isolate_lru(T)
>>>>
>>>> If refcount is 0 at this point, VM_BUG_ON_FOLIO(!folio_ref_count(folio), folio) in
>>>> folio_isolate_lru() would be triggered. Maybe we could just return false in that case.
>>>>
>>>>>> folio_test_clear_lru(T)
>>>>>> __folio_put(T)
>>>>>> __page_cache_release(T)
>>>>>> folio_test_lru(T) == false
>>>>>> /* skip lruvec_del_folio(T) */
>>>>>> free_frozen_pages(T)
>>>>>> folio_get(T)
>>>>>> lruvec_del_folio(T)
>>>>
>>>> But in CPU2 (folio_isolate_lru), lruvec_del_folio(T) should remove T from LRU list.
>>>>
>>>>>> later:
>>>>>> list_del(adjacent->lru)
>>>>>> next == &T->lru
>>>>>> next->prev == LIST_POISON / PCP freelist
>>>>>> BUG
>>>>>>
>>>>
>>>> Why does CPU0 still see the stale link from adjacent?
>>>>
>>>>>> Assisted-by: Cursor:claude-opus-4-8
>>>>>> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
>>>>>
>>>>> I'm wondering if this has been broken the whole time, or if some rework allowed
>>>>> this to trigger.
>>>>>
>>>>> I assume the issue can be triggered for other FSes, and we want Fixes: + CC: stable?
>>>>>
>>>>> Looking into the history, I think we always unconditionally did the
>>>>> lru_add_split_folio()/lru_add_page_tail().
>>>>>
>>>>>> ---
>>>>>> mm/huge_memory.c | 2 +-
>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index 970e077019b7..7465525a94a8 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -3966,7 +3966,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
>>>>>> folio_ref_unfreeze(new_folio,
>>>>>> folio_cache_ref_count(new_folio) + 1);
>>>>>>
>>>>>> - if (do_lru)
>>>>>> + if (do_lru && !(mapping && new_folio->index >= end))
>>>>>
>>>>> It might be clearer to write this as
>>>>>
>>>>> do_lru && (!mapping || new_folio->index < end)
>>>>>
>>>>> To match the page-cache check further below
>>>>>
>>>>> if (!mapping)
>>>>> continue
>>>>>
>>>>> ...
>>>>> if (new_folio->index < end)
>>>>> ...
>>>>>
>>>>>> lru_add_split_folio(folio, new_folio, lruvec, list);
>>>
>>> Talked to Claude and find an accounting issue with this. Without putting
>>> EOF after-split folios back to LRU, they are not going through lruvec_del_folio(),
>>> which decreases NR_*_LRU counter along with removing the folio from LRU
>>> and it causes NR_*_LRU accounting errors. Note that the original folio
>>> is on LRU all the time and LRU counters are not modified and after the split
>>> the original folio size is decreased and the after-split folios need to
>>> be added back to LRU to keep the LRU counters right. We will need to adjust
>>> LRU accounting for (!mapping || new_folio->index < end) if we decide to
>>> not add them back to LRU.
>>>
>>>
>>> Best Regards,
>>> Yan, Zi
>>
>>
>> Best Regards,
>> Yan, Zi

--
Best Regards,
Yan, Zi