Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
From: Baolin Wang
Date: Fri Jan 16 2026 - 10:49:22 EST
On 1/16/26 10:28 PM, Barry Song wrote:
On Fri, Jan 16, 2026 at 5:53 PM Dev Jain <dev.jain@xxxxxxx> wrote:
On 07/01/26 7:16 am, Wei Yang wrote:
On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@xxxxxxxxx> wrote:Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:I don’t quite understand your question. For nr_pages > 1 but not equal
Similar to folio_referenced_one(), we can apply batched unmapping for fileHi, Baolin
large folios to optimize the performance of file folios reclamation.
Barry previously implemented batched unmapping for lazyfree anonymous large
folios[1] and did not further optimize anonymous large folios or file-backed
large folios at that stage. As for file-backed large folios, the batched
unmapping support is relatively straightforward, as we only need to clear
the consecutive (present) PTE entries for file-backed large folios.
Performance testing:
Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
75% performance improvement on my Arm64 32-core server (and 50%+ improvement
on my X86 machine) with this patch.
W/o patch:
real 0m1.018s
user 0m0.000s
sys 0m1.018s
W/ patch:
real 0m0.249s
user 0m0.000s
sys 0m0.249s
[1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@xxxxxxxxx/T/#u
Reviewed-by: Ryan Roberts <ryan.roberts@xxxxxxx>
Acked-by: Barry Song <baohua@xxxxxxxxxx>
Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
---
mm/rmap.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index 985ab0b085ba..e1d16003c514 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
end_addr = pmd_addr_end(addr, vma->vm_end);
max_nr = (end_addr - addr) >> PAGE_SHIFT;
- /* We only support lazyfree batching for now ... */
- if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
+ /* We only support lazyfree or file folios batching for now ... */
+ if (folio_test_anon(folio) && folio_test_swapbacked(folio))
return 1;
+
if (pte_unused(pte))
return 1;
@@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
*
* See Documentation/mm/mmu_notifier.rst
*/
- dec_mm_counter(mm, mm_counter_file(folio));
+ add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
}
discard:
if (unlikely(folio_test_hugetlb(folio))) {
--
2.47.3
When reading your patch, I come up one small question.
Current try_to_unmap_one() has following structure:
try_to_unmap_one()
while (page_vma_mapped_walk(&pvmw)) {
nr_pages = folio_unmap_pte_batch()
if (nr_pages = folio_nr_pages(folio))
goto walk_done;
}
I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
If my understanding is correct, page_vma_mapped_walk() would start from
(pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
(pvmw->address + nr_pages * PAGE_SIZE), right?
Not sure my understanding is correct, if so do we have some reason not to
skip the cleared range?
to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
take a look:
next_pte:
do {
pvmw->address += PAGE_SIZE;
if (pvmw->address >= end)
return not_found(pvmw);
/* Did we cross page table boundary? */
if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
if (pvmw->ptl) {
spin_unlock(pvmw->ptl);
pvmw->ptl = NULL;
}
pte_unmap(pvmw->pte);
pvmw->pte = NULL;
pvmw->flags |= PVMW_PGTABLE_CROSSED;
goto restart;
}
pvmw->pte++;
} while (pte_none(ptep_get(pvmw->pte)));
will be skipped.
I mean maybe we can skip it in try_to_unmap_one(), for example:
diff --git a/mm/rmap.c b/mm/rmap.c
index 9e5bd4834481..ea1afec7c802 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
*/
if (nr_pages == folio_nr_pages(folio))
goto walk_done;
+ else {
+ pvmw.address += PAGE_SIZE * (nr_pages - 1);
+ pvmw.pte += nr_pages - 1;
+ }
continue;
walk_abort:
ret = false;
I am of the opinion that we should do something like this. In the internal pvmw code,
I am still not convinced that skipping PTEs in try_to_unmap_one()
is the right place. If we really want to skip certain PTEs early,
should we instead hint page_vma_mapped_walk()? That said, I don't
see much value in doing so, since in most cases nr is either 1 or
folio_nr_pages(folio).
we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
to not none, and we will lose the batching effect. I also plan to extend support to
anonymous folios (therefore generalizing for all types of memory) which will set a
batch of ptes as swap, and the internal pvmw code won't be able to skip through the
batch.
Thanks for catching this, Dev. I already filter out some of the more
complex cases, for example:
if (pte_unused(pte))
return 1;
Hi Dev, thanks for the report[1], and you also explained why mm-selftets can pass.
[1] https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@xxxxxxx/
Since the userfaultfd write-protection case is also a corner case,
could we filter it out as well?
diff --git a/mm/rmap.c b/mm/rmap.c
index c86f1135222b..6bb8ba6f046e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1870,6 +1870,9 @@ static inline unsigned int
folio_unmap_pte_batch(struct folio *folio,
if (pte_unused(pte))
return 1;
+ if (userfaultfd_wp(vma))
+ return 1;
+
return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
}
That small fix makes sense to me. I think Dev can continue to support the UFFD batch optimization, and we need more review and testing for the UFFD batched operations, as David suggested[2].
[2] https://lore.kernel.org/all/9edeeef1-5553-406b-8e56-30b11809eec5@xxxxxxxxxx/