Re: [PATCHv2] mm/huge_memory: do not add dropped split tail folios to LRU

From: David Hildenbrand (Arm)

Date: Mon Jun 15 2026 - 07:26:26 EST


On 6/13/26 01:46, Zhaoyang Huang wrote:
> On Sat, Jun 13, 2026 at 12:34 AM David Hildenbrand (Arm)
> <david@xxxxxxxxxx> wrote:
>>
>> On 6/12/26 04:34, zhaoyang.huang wrote:
>>> From: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
>>>
>>> The kernel panics are keeping to be reported especially when the f2fs
>>> partition get almost full. By investigation, we find that the reason is
>>> one f2fs page got freed to buddy without being deleted from LRU and the
>>> root cause is the race happened in [2] which is enrolled by this commit.
>>> We solve this issue by reverting a f2fs commit 9609dd704725 ("f2fs: remove
>>> non-uptodate folio from the page cache in move_data_block").
>>>
>>> There are 3 race processes in this scenario, please find below for their
>>> main activities. However, by further investigation over the code, I
>>> think there is a common race window for the truncated folios between
>>> split_folio_to_order and folio_isolate_lru, where the folios lost the
>>> refcount on page cache and remains the transient one of the split
>>> caller, under which the folio could enter free path and compete with the
>>> isolation process. This commit would like to suggest to have the folios
>>> beyond EOF stay out of LRU.
>>>
>>> Split:
>>> split_folio_to_order() can split the big folio into individual pages and
>>> put the resulting subpages back on the LRU. For tail pages beyond EOF,
>>> split removes them from the page cache and drops their page-cache
>>> references. A tail page can then remain on the LRU with PG_lru set while
>>> holding only the split caller's temporary reference. When
>>> free_folio_and_swap_cache() drops that final reference, the page enters
>>> the final folio_put() release path.
>>>
>>> Truncate:
>>> The changed code in move_data_block() lets the GC path evict the tail-end
>>> folio from the page cache through folio_end_dropbehind(). Once
>>> folio_unmap_invalidate() removes the folio from mapping->i_pages, the
>>> page-cache references for all pages in the folio are dropped. The folio
>>> is then kept alive only by temporary external references, which allows a
>>> later split to operate on a folio whose subpages are no longer protected
>>> by page-cache references.
>>>
>>> Isolate:
>>> In parallel, folio_isolate_lru() can observe the same tail page with a
>>> non-zero refcount and PG_lru set. It clears PG_lru before taking its own
>>> reference. If this races with the final folio_put() from the split path,
>>> __folio_put() sees PG_lru already cleared and skips lruvec_del_folio().
>>> The page is then freed back to the allocator while its lru links are
>>> still present in the LRU list. A later LRU operation on a neighboring
>>> page detects the stale link and reports list corruption.
>>>
>>> [1]
>>> [ 22.486082] list_del corruption. next->prev should be fffffffec10e0ac8, but was dead000000000122. (next=fffffffec10e0a88)
>>> [ 22.486130] ------------[ cut here ]------------
>>> [ 22.486134] kernel BUG at lib/list_debug.c:67!
>>> [ 22.486141] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>>> [ 22.488502] Tainted: [W]=WARN, [O]=OOT_MODULE
>>> [ 22.488506] Hardware name: Spreadtrum UMS9230 1H10 SoC (DT)
>>> [ 22.488511] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [ 22.488517] pc : __list_del_entry_valid_or_report+0x14c/0x154
>>> [ 22.488531] lr : __list_del_entry_valid_or_report+0x14c/0x154
>>> [ 22.488539] sp : ffffffc08006b830
>>> [ 22.488542] x29: ffffffc08006b868 x28: 0000000000003020 x27: 0000000000000000
>>> [ 22.488553] x26: 0000000000000000 x25: 0000000000000004 x24: fffffffec10e0ac0
>>> [ 22.488564] x23: 00000000000000e8 x22: 0000000000000024 x21: dead000000000122
>>> [ 22.488574] x20: fffffffec10e0a88 x19: fffffffec10e0ac8 x18: ffffffc080061060
>>> [ 22.488585] x17: 20747562202c3863 x16: 6130653031636566 x15: 0000000000000058
>>> [ 22.488595] x14: 0000000000000004 x13: ffffff80f91e0000 x12: 0000000000000003
>>> [ 22.488605] x11: 0000000000000003 x10: 0000000000000001 x9 : ffe85721f0e25f00
>>> [ 22.488615] x8 : ffe85721f0e25f00 x7 : 0000000000000000 x6 : 6c65645f7473696c
>>> [ 22.488625] x5 : ffffffed39b23026 x4 : 0000000000000000 x3 : 0000000000000010
>>> [ 22.488636] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 000000000000006d
>>> [ 22.488647] Call trace:
>>> [ 22.488651] __list_del_entry_valid_or_report+0x14c/0x154 (P)
>>> [ 22.488661] __folio_put+0x2bc/0x434
>>> [ 22.488670] folio_put+0x28/0x58
>>> [ 22.488678] do_garbage_collect+0x1a34/0x2584
>>> [ 22.488689] f2fs_gc+0x230/0x9b4
>>> [ 22.488697] f2fs_fallocate+0xb90/0xdf4
>>> [ 22.488706] vfs_fallocate+0x1b4/0x2bc
>>> [ 22.488716] __arm64_sys_fallocate+0x44/0x78
>>> [ 22.488725] invoke_syscall+0x58/0xe4
>>> [ 22.488732] do_el0_svc+0x48/0xdc
>>> [ 22.488739] el0_svc+0x3c/0x98
>>> [ 22.488747] el0t_64_sync_handler+0x20/0x130
>>> [ 22.488754] el0t_64_sync+0x1c4/0x1c8
>>>
>>> [2]
>>> *F: big folio before split
>>> *T: tail folio after split
>>> CPU0 (f2fs GC) CPU1 (split_folio_to_order) CPU2 (folio_isolate_lru)
>>> *F: pagecache refs = n
>>> *F: extra refs = split
>>> *F: PG_lru set, mapping != NULL
>>> split_folio_to_order(F)
>>> folio_ref_freeze(F, 1)
>>> ...
>>> lru_add_split_folio(T)
>>> list_add_tail(&T->lru, &F->lru)
>>> folio_set_lru(T)
>>> folio_unlock(T)
>>> /* T PageLRU set */
>>>
>>> *T: pagecache refs = 1
>>> *T: extra refs = GC + split
>>> *T: PG_lru set, mapping != NULL
>>>
>>> move_data_block()
>>> folio = f2fs_grab_cache_folio(T)
>>> ...
>>> __folio_set_dropbehind(T)
>>> folio_unlock(T)
>>> folio_end_dropbehind(T)
>>> folio_unmap_invalidate(T)
>>> __filemap_remove_folio(T)
>>> folio_put_refs(T, 1)
>>> folio_put(T)
>>>
>>> *T: pagecache refs = 0
>>> *T: extra refs = split
>>> *T: PG_lru set, mapping == NULL
>>> free_folio_and_swap_cache(T)
>>> folio_put_testzero(T)
>>> /* refcount: 1 -> 0 */
>>>
>>> *T: pagecache refs = 0
>>> *T: extra refs = isolate
>>> *T: PG_lru set, mapping == NULL
>>> folio_isolate_lru(T)
>>> folio_test_clear_lru(T)
>>> __folio_put(T)
>>> __page_cache_release(T)
>>> folio_test_lru(T) == false
>>> /* skip lruvec_del_folio(T) */
>>> free_frozen_pages(T)
>>> folio_get(T)
>>> lruvec_del_folio(T)
>>
>> What I am still struggling with:
>>
>> folio_isolate_lru() is documented to:
>>
>> "Must be called with an elevated refcount on the folio.".
>>
>> That is, there must be *something* keeping the folio alive before the
>> folio_get(). It could be a page table mapping with the PTL held (which does not
>> apply here).
> According to my understanding, the folio_end_dropbehind could have all
> these things gone such as page table mappings, right?
>>
>> Or it could be another prior folio_try_get() (e.g., do_migrate_range()).
>>
>> But essentially, folio_isolate_lru() cannot race with __folio_put(). It would be
>> fundamentally broken.
> Yes, I agree. so If below sequence possible which could be deemed as
> folio_isolate_lru race with folio_put?
>
> free_folio_and_swap_cache(T)
> folio_put_testzero(T)
> /* refcount: 1 -> 0 */
>
> *T: pagecache refs = 0
> *T: extra refs = isolate
> *T: PG_lru set, mapping == NULL
>
> folio_isolate_lru(T)

How?

Just take a look at folio_isolate_lru() callers:

mm/damon/paddr.c

-> We have a valid reference from damon_get_folio().

mm/gup.c

-> We have a valid reference from GUP

mm/khugepaged.c

-> We have a valid reference either from the page table or from the pagecache.
Locks (PT lock, xarray lock) make sure that we cannot race.

mm/madvise.c

-> We have a valid reference from the page table.

mm/memory_hotplug.c

-> We have a valid reference from folio_try_get()

mm/mempolicy.c

-> We have a valid reference from the page table.

mm/migrate_device.c

-> We have a valid reference from migration start code


The only confusing bit is mm/memory-failure.c. I would assume that we end up
calling it only through identify_page_state() after having obtained a reference
through e.g., get_hwpoison_page().


So no, it doesn't make any sense. Note that we even have

VM_BUG_ON_FOLIO(!folio_ref_count(folio), folio);

in folio_isolate_lru().

If you can trigger that (enable CONFIG_DEBUG_VM), then
something is fundamentally flawed there (e.g., refcount mis-balancing in your
environment).


But having this be an existing upstream problem is *unlikely*.

In short: folio_isolate_lru() cannot possibly race with folio freeing unless
something is fundamentally broken.

--
Cheers,

David