Re: [PATCHv2] mm/huge_memory: do not add dropped split tail folios to LRU

From: Zi Yan

Date: Mon Jun 15 2026 - 12:01:01 EST


On 12 Jun 2026, at 19:55, Zhaoyang Huang wrote:

> On Sat, Jun 13, 2026 at 7:46 AM Zi Yan <ziy@xxxxxxxxxx> wrote:
>>
>> On 12 Jun 2026, at 19:38, Zhaoyang Huang wrote:
>>
>>> On Fri, Jun 12, 2026 at 10:12 PM Zi Yan <ziy@xxxxxxxxxx> wrote:
>>>>
>>>> On 11 Jun 2026, at 22:34, zhaoyang.huang wrote:
>>>>
>>>>> From: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
>>>>>
>>>>> The kernel panics are keeping to be reported especially when the f2fs
>>>>> partition get almost full. By investigation, we find that the reason is
>>>>> one f2fs page got freed to buddy without being deleted from LRU and the
>>>>> root cause is the race happened in [2] which is enrolled by this commit.
>>>>> We solve this issue by reverting a f2fs commit 9609dd704725 ("f2fs: remove
>>>>> non-uptodate folio from the page cache in move_data_block").
>>>>>
>>>>> There are 3 race processes in this scenario, please find below for their
>>>>> main activities. However, by further investigation over the code, I
>>>>> think there is a common race window for the truncated folios between
>>>>> split_folio_to_order and folio_isolate_lru, where the folios lost the
>>>>> refcount on page cache and remains the transient one of the split
>>>>> caller, under which the folio could enter free path and compete with the
>>>>> isolation process. This commit would like to suggest to have the folios
>>>>> beyond EOF stay out of LRU.
>>>>>
>>>>> Split:
>>>>> split_folio_to_order() can split the big folio into individual pages and
>>>>> put the resulting subpages back on the LRU. For tail pages beyond EOF,
>>>>> split removes them from the page cache and drops their page-cache
>>>>> references. A tail page can then remain on the LRU with PG_lru set while
>>>>> holding only the split caller's temporary reference. When
>>>>> free_folio_and_swap_cache() drops that final reference, the page enters
>>>>> the final folio_put() release path.
>>>>>
>>>>> Truncate:
>>>>> The changed code in move_data_block() lets the GC path evict the tail-end
>>>>> folio from the page cache through folio_end_dropbehind(). Once
>>>>> folio_unmap_invalidate() removes the folio from mapping->i_pages, the
>>>>> page-cache references for all pages in the folio are dropped. The folio
>>>>> is then kept alive only by temporary external references, which allows a
>>>>> later split to operate on a folio whose subpages are no longer protected
>>>>> by page-cache references.
>>>>>
>>>>> Isolate:
>>>>> In parallel, folio_isolate_lru() can observe the same tail page with a
>>>>> non-zero refcount and PG_lru set. It clears PG_lru before taking its own
>>>>> reference. If this races with the final folio_put() from the split path,
>>>>> __folio_put() sees PG_lru already cleared and skips lruvec_del_folio().
>>>>> The page is then freed back to the allocator while its lru links are
>>>>> still present in the LRU list. A later LRU operation on a neighboring
>>>>> page detects the stale link and reports list corruption.
>>>>>
>>>>> [1]
>>>>> [ 22.486082] list_del corruption. next->prev should be fffffffec10e0ac8, but was dead000000000122. (next=fffffffec10e0a88)
>>>>> [ 22.486130] ------------[ cut here ]------------
>>>>> [ 22.486134] kernel BUG at lib/list_debug.c:67!
>>>>> [ 22.486141] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>>>>> [ 22.488502] Tainted: [W]=WARN, [O]=OOT_MODULE
>>>>> [ 22.488506] Hardware name: Spreadtrum UMS9230 1H10 SoC (DT)
>>>>> [ 22.488511] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>> [ 22.488517] pc : __list_del_entry_valid_or_report+0x14c/0x154
>>>>> [ 22.488531] lr : __list_del_entry_valid_or_report+0x14c/0x154
>>>>> [ 22.488539] sp : ffffffc08006b830
>>>>> [ 22.488542] x29: ffffffc08006b868 x28: 0000000000003020 x27: 0000000000000000
>>>>> [ 22.488553] x26: 0000000000000000 x25: 0000000000000004 x24: fffffffec10e0ac0
>>>>> [ 22.488564] x23: 00000000000000e8 x22: 0000000000000024 x21: dead000000000122
>>>>> [ 22.488574] x20: fffffffec10e0a88 x19: fffffffec10e0ac8 x18: ffffffc080061060
>>>>> [ 22.488585] x17: 20747562202c3863 x16: 6130653031636566 x15: 0000000000000058
>>>>> [ 22.488595] x14: 0000000000000004 x13: ffffff80f91e0000 x12: 0000000000000003
>>>>> [ 22.488605] x11: 0000000000000003 x10: 0000000000000001 x9 : ffe85721f0e25f00
>>>>> [ 22.488615] x8 : ffe85721f0e25f00 x7 : 0000000000000000 x6 : 6c65645f7473696c
>>>>> [ 22.488625] x5 : ffffffed39b23026 x4 : 0000000000000000 x3 : 0000000000000010
>>>>> [ 22.488636] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 000000000000006d
>>>>> [ 22.488647] Call trace:
>>>>> [ 22.488651] __list_del_entry_valid_or_report+0x14c/0x154 (P)
>>>>> [ 22.488661] __folio_put+0x2bc/0x434
>>>>> [ 22.488670] folio_put+0x28/0x58
>>>>> [ 22.488678] do_garbage_collect+0x1a34/0x2584
>>>>> [ 22.488689] f2fs_gc+0x230/0x9b4
>>>>> [ 22.488697] f2fs_fallocate+0xb90/0xdf4
>>>>> [ 22.488706] vfs_fallocate+0x1b4/0x2bc
>>>>> [ 22.488716] __arm64_sys_fallocate+0x44/0x78
>>>>> [ 22.488725] invoke_syscall+0x58/0xe4
>>>>> [ 22.488732] do_el0_svc+0x48/0xdc
>>>>> [ 22.488739] el0_svc+0x3c/0x98
>>>>> [ 22.488747] el0t_64_sync_handler+0x20/0x130
>>>>> [ 22.488754] el0t_64_sync+0x1c4/0x1c8
>>>>>
>>>>> [2]
>>>>> *F: big folio before split
>>>>> *T: tail folio after split
>>>>> CPU0 (f2fs GC) CPU1 (split_folio_to_order) CPU2 (folio_isolate_lru)
>>>>> *F: pagecache refs = n
>>>>> *F: extra refs = split
>>>>> *F: PG_lru set, mapping != NULL
>>>>> split_folio_to_order(F)
>>>>> folio_ref_freeze(F, 1)
>>>>> ...
>>>>> lru_add_split_folio(T)
>>>>> list_add_tail(&T->lru, &F->lru)
>>>>> folio_set_lru(T)
>>>>> folio_unlock(T)
>>>>> /* T PageLRU set */
>>>>>
>>>>> *T: pagecache refs = 1
>>>>> *T: extra refs = GC + split
>>>>> *T: PG_lru set, mapping != NULL
>>>>>
>>>>> move_data_block()
>>>>> folio = f2fs_grab_cache_folio(T)
>>>>
>>>> Getting T from f2fs_grab_cache_folio() via __filemap_get_folio() should
>>>> be impossible, since the xarray should either have the original folio F
>>>> or folios are not EOF. But it reminds me a bug I fixed recently.
>>> The folio be out of EOF is caused by the coming folio_end_dropbehind, right?
>>
>> To be specific, __filemap_get_folio() can only see:
>> 1. the original folio before the split, because only the original folio
>> is present in the xarray as a multi-index entry;
>>
>> 2. after-split folios that are not EOF, because EOF folios are not added
>> to the xarray[1] for __filemap_get_folio() to get.
>>
>> That is why I said getting EOF tail folios from __filemap_get_folio()
>> is impossible.
>>
>> [1] https://elixir.bootlin.com/linux/v7.0.12/source/mm/huge_memory.c#L3884
> understood. Could the truncate of the file happen after the split
> since the callstack is launched by vfs_fallocate.

Can you elaborate on the code behavior in terms of “truncate”?
After split, after-split folios are updated in xarray, then the issue has
nothing to do with folio_split() anymore. I am not sure what issue you are
trying to discuss here.


>
>>>>
>>>> Does your v6.18 kernel have this commit[1], which fixes an xarray issue
>>>> in folio_split()? If not, can you give it a try and see if the issue
>>>> goes away?
>>> Yes, this patch is on the tree.
>>>>
>>>>
>>>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v6.18.35&id=08b2b65c63bb26dbb2a4e2adc2ce96e2929b8b60
>>>>
>>>>> ...
>>>>> __folio_set_dropbehind(T)
>>>>> folio_unlock(T)
>>>>> folio_end_dropbehind(T)
>>>>> folio_unmap_invalidate(T)
>>>>> __filemap_remove_folio(T)
>>>>> folio_put_refs(T, 1)
>>>>> folio_put(T)
>>>>>
>>>>> *T: pagecache refs = 0
>>>>> *T: extra refs = split
>>>>> *T: PG_lru set, mapping == NULL
>>>>> free_folio_and_swap_cache(T)
>>>>> folio_put_testzero(T)
>>>>> /* refcount: 1 -> 0 */
>>>>>
>>>>> *T: pagecache refs = 0
>>>>> *T: extra refs = isolate
>>>>> *T: PG_lru set, mapping == NULL
>>>>> folio_isolate_lru(T)
>>>>> folio_test_clear_lru(T)
>>>>> __folio_put(T)
>>>>> __page_cache_release(T)
>>>>> folio_test_lru(T) == false
>>>>> /* skip lruvec_del_folio(T) */
>>>>> free_frozen_pages(T)
>>>>> folio_get(T)
>>>>> lruvec_del_folio(T)
>>>>> later:
>>>>> list_del(adjacent->lru)
>>>>> next == &T->lru
>>>>> next->prev == LIST_POISON / PCP freelist
>>>>> BUG
>>>>>
>>>>> Assisted-by: Cursor:claude-opus-4-8
>>>>> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
>>>>> ---
>>>>> patchv2: update codes to eliminate bad page status
>>>>> ---
>>>>> ---
>>>>> mm/huge_memory.c | 22 +++++++++++++++++++++-
>>>>> 1 file changed, 21 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index 970e077019b7..c24c12f71157 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -3878,6 +3878,23 @@ static unsigned int folio_cache_ref_count(const struct folio *folio)
>>>>> return folio_nr_pages(folio);
>>>>> }
>>>>>
>>>>> +static void clear_dropped_split_folio_lru_flags(struct folio *folio)
>>>>> +{
>>>>> + /*
>>>>> + * __split_folio_to_order() clones these LRU state bits from the
>>>>> + * original folio. A folio that is dropped instead of being added to
>>>>> + * the LRU will not pass through lruvec_del_folio() and
>>>>> + * __folio_clear_lru_flags(), so clear the cloned state before it is
>>>>> + * freed back to the page allocator.
>>>>> + */
>>>>> + set_mask_bits(&folio->flags.f,
>>>>> + (1UL << PG_referenced) | (1UL << PG_active) |
>>>>> + (1UL << PG_workingset) |
>>>>> + (1UL << PG_unevictable) | __PG_MLOCKED |
>>>>> + LRU_GEN_MASK | LRU_REFS_MASK,
>>>>> + 0);
>>>>> +}
>>>>> +
>>>>> static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int new_order,
>>>>> struct page *split_at, struct xa_state *xas,
>>>>> struct address_space *mapping, bool do_lru,
>>>>> @@ -3958,6 +3975,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
>>>>> for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>> new_folio = next) {
>>>>> unsigned long nr_pages = folio_nr_pages(new_folio);
>>>>> + bool drop = mapping && new_folio->index >= end;
>>>>>
>>>>> next = folio_next(new_folio);
>>>>>
>>>>> @@ -3966,7 +3984,9 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
>>>>> folio_ref_unfreeze(new_folio,
>>>>> folio_cache_ref_count(new_folio) + 1);
>>>>>
>>>>> - if (do_lru)
>>>>> + if (drop)
>>>>> + clear_dropped_split_folio_lru_flags(new_folio);
>>>>> + else if (do_lru)
>>>>> lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>>
>>>>> /*
>>>>> --
>>>>> 2.25.1
>>>>
>>>>
>>>> Best Regards,
>>>> Yan, Zi
>>
>>
>> --
>> Best Regards,
>> Yan, Zi


Best Regards,
Yan, Zi