Re: [PATCH v5 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()

From: Qi Zheng

Date: Thu Nov 06 2025 - 21:51:32 EST




On 11/6/25 10:52 PM, Wei Yang wrote:
On Wed, Oct 15, 2025 at 02:35:32PM +0800, Qi Zheng wrote:
From: Muchun Song <songmuchun@xxxxxxxxxxxxx>

The maintenance of the folio->_deferred_list is intricate because it's
reused in a local list.

Here are some peculiarities:

1) When a folio is removed from its split queue and added to a local
on-stack list in deferred_split_scan(), the ->split_queue_len isn't
updated, leading to an inconsistency between it and the actual
number of folios in the split queue.

2) When the folio is split via split_folio() later, it's removed from
the local list while holding the split queue lock. At this time,
the lock is not needed as it is not protecting anything.

3) To handle the race condition with a third-party freeing or migrating
the preceding folio, we must ensure there's always one safe (with
raised refcount) folio before by delaying its folio_put(). More
details can be found in commit e66f3185fa04 ("mm/thp: fix deferred
split queue not partially_mapped"). It's rather tricky.

We can use the folio_batch infrastructure to handle this clearly. In this
case, ->split_queue_len will be consistent with the real number of folios
in the split queue. If list_empty(&folio->_deferred_list) returns false,
it's clear the folio must be in its split queue (not in a local list
anymore).

In the future, we will reparent LRU folios during memcg offline to
eliminate dying memory cgroups, which requires reparenting the split queue
to its parent first. So this patch prepares for using
folio_split_queue_lock_irqsave() as the memcg may change then.

Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx>
Signed-off-by: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>
Reviewed-by: Zi Yan <ziy@xxxxxxxxxx>
Acked-by: David Hildenbrand <david@xxxxxxxxxx>
Acked-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>
---
mm/huge_memory.c | 87 +++++++++++++++++++++++-------------------------
1 file changed, 41 insertions(+), 46 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a68f26547cd99..e850bc10da3e2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3782,21 +3782,22 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
struct lruvec *lruvec;
int expected_refs;

- if (folio_order(folio) > 1 &&
- !list_empty(&folio->_deferred_list)) {
- ds_queue->split_queue_len--;
+ if (folio_order(folio) > 1) {
+ if (!list_empty(&folio->_deferred_list)) {
+ ds_queue->split_queue_len--;
+ /*
+ * Reinitialize page_deferred_list after removing the
+ * page from the split_queue, otherwise a subsequent
+ * split will see list corruption when checking the
+ * page_deferred_list.
+ */
+ list_del_init(&folio->_deferred_list);
+ }
if (folio_test_partially_mapped(folio)) {
folio_clear_partially_mapped(folio);
mod_mthp_stat(folio_order(folio),
MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
}
- /*
- * Reinitialize page_deferred_list after removing the
- * page from the split_queue, otherwise a subsequent
- * split will see list corruption when checking the
- * page_deferred_list.
- */
- list_del_init(&folio->_deferred_list);

@Andrew

Current mm-new looks not merge the code correctly?

The above removed code is still there.

@Qi

After rescan this, I am confused about this code change.

The difference here is originally it would check/clear partially_mapped if
folio is on a list. But now we would do this even folio is not on a list.

If my understanding is correct, after this change, !list_empty() means folio
is on its ds_queue. And there are total three places to remove it from
ds_queue.

1) __folio_unqueue_deferred_split()
2) deferred_split_scan()
3) __folio_split()

In 1) and 2) we all clear partially_mapped bit before removing folio from
ds_queue, this means if the folio is not on ds_queue in __folio_split(), it is
not necessary to check/clear partially_mapped bit.

In deferred_split_scan(), if folio_try_get() succeeds, then only the
folio will be removed from ds_queue, but not clear partially_mapped.


Maybe I missed something, would you mind correct me on this?