On Tue, Apr 2, 2024 at 8:58 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
On 01.04.24 10:17, zhaoyang.huang wrote:
From: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
An VM_BUG_ON in step 9 of [1] could happen as the refcnt is dropped
unproperly during the procedure of read_pages()->readahead_folio->folio_put.
This is introduced by commit 9fd472af84ab ("mm: improve cleanup when
->readpages doesn't process all pages")'.
key steps of[1] in brief:
2'. Thread_truncate get folio to its local fbatch by find_get_entry in step 2
7'. Last refcnt remained which is not as expect as from alloc_pages
but from thread_truncate's local fbatch in step 7
8'. Thread_reclaim succeed to isolate the folio by the wrong refcnt(not
the value but meaning) in step 8
9'. Thread_truncate hit the VM_BUG_ON in step 9
[1]
Thread_readahead:
0. folio = filemap_alloc_folio(gfp_mask, 0);
(refcount 1: alloc_pages)
1. ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
(refcount 2: alloc_pages, page_cache)
Thread_truncate:
2. folio = find_get_entries(&fbatch_truncate);
(refcount 3: alloc_pages, page_cache, fbatch_truncate))
Thanks for feedback. Above callstack is a theoretical issue so far
Something that would help here is an actual reproducer that triggersthis
issue.
To me, it's unclear at this point if we are talking about an actual
issue or a theoretical issue?
which is arised from an ongoing analysis of a practical livelock issue
generated by folio_try_get_rcu which is related to abnormal folio
refcnt state. So do you think this callstack makes sense?