Re: [PATCH mm-new v7 4/5] mm: khugepaged: skip lazy-free folios

From: Lance Yang

Date: Sat Feb 07 2026 - 08:52:03 EST

On 2026/2/7 16:34, Barry Song wrote:

On Sat, Feb 7, 2026 at 4:16 PM Vernon Yang <vernon2gm@xxxxxxxxx> wrote:

From: Vernon Yang <yanglincheng@xxxxxxxxxx>

For example, create three task: hot1 -> cold -> hot2. After all three
task are created, each allocate memory 128MB. the hot1/hot2 task
continuously access 128 MB memory, while the cold task only accesses
its memory briefly and then call madvise(MADV_FREE). However, khugepaged
still prioritizes scanning the cold task and only scans the hot2 task
after completing the scan of the cold task.

And if we collapse with a lazyfree page, that content will never be none
and the deferred shrinker cannot reclaim them.

So if the user has explicitly informed us via MADV_FREE that this memory
will be freed, it is appropriate for khugepaged to skip it only, thereby
avoiding unnecessary scan and collapse operations to reducing CPU
wastage.

Here are the performance test results:
(Throughput bigger is better, other smaller is better)

Testing on x86_64 machine:

| task hot2 | without patch | with patch | delta |
|---------------------|---------------|---------------|---------|
| total accesses time | 3.14 sec | 2.93 sec | -6.69% |
| cycles per access | 4.96 | 2.21 | -55.44% |
| Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
| dTLB-load-misses | 284814532 | 69597236 | -75.56% |

Testing on qemu-system-x86_64 -enable-kvm:

| task hot2 | without patch | with patch | delta |
|---------------------|---------------|---------------|---------|
| total accesses time | 3.35 sec | 2.96 sec | -11.64% |
| cycles per access | 7.29 | 2.07 | -71.60% |
| Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
| dTLB-load-misses | 241600871 | 3216108 | -98.67% |

Signed-off-by: Vernon Yang <yanglincheng@xxxxxxxxxx>
Acked-by: David Hildenbrand (arm) <david@xxxxxxxxxx>
Reviewed-by: Lance Yang <lance.yang@xxxxxxxxx>
---
include/trace/events/huge_memory.h | 1 +
mm/khugepaged.c | 13 +++++++++++++
2 files changed, 14 insertions(+)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 384e29f6bef0..bcdc57eea270 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -25,6 +25,7 @@
EM( SCAN_PAGE_LRU, "page_not_in_lru") \
EM( SCAN_PAGE_LOCK, "page_locked") \
EM( SCAN_PAGE_ANON, "page_not_anon") \
+ EM( SCAN_PAGE_LAZYFREE, "page_lazyfree") \
EM( SCAN_PAGE_COMPOUND, "page_compound") \
EM( SCAN_ANY_PROCESS, "no_process_for_page") \
EM( SCAN_VMA_NULL, "vma_null") \
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8b68ae3bc2c5..0d160e612e16 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -46,6 +46,7 @@ enum scan_result {
SCAN_PAGE_LRU,
SCAN_PAGE_LOCK,
SCAN_PAGE_ANON,
+ SCAN_PAGE_LAZYFREE,
SCAN_PAGE_COMPOUND,
SCAN_ANY_PROCESS,
SCAN_VMA_NULL,
@@ -583,6 +584,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
folio = page_folio(page);
VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);

+ if (cc->is_khugepaged && !pte_dirty(pteval) &&
+ folio_test_lazyfree(folio)) {

We have two corner cases here:

Good catch!

1. Even if a lazyfree folio is dirty, if the VMA has the VM_DROPPABLE flag,
a lazyfree folio may still be dropped, even when its PTE is dirty.

Right. When the VMA has VM_DROPPABLE, we would drop the lazyfree folio
regardless of whether it (or the PTE) is dirty in try_to_unmap_one().

So, IMHO, we could go with:

cc->is_khugepaged && folio_test_lazyfree(folio) &&
(!pte_dirty(pteval) || (vma->vm_flags & VM_DROPPABLE))

2. GUP operation can cause a folio to become dirty.

Emm... I don't think we need to do anything special for GUP here :)

IIUC, if the range is pinned, MADV_COLLAPSE/khugepaged already fails;
We hit the refcount check in hpage_collapse_scan_pmd() (expected vs
actual refcount) and return -EAGAIN.

```
/*
* Check if the page has any GUP (or other external) pins.
*
* Here the check may be racy:
* it may see folio_mapcount() > folio_ref_count().
* But such case is ephemeral we could always retry collapse
* later. However it may report false positive if the page
* has excessive GUP pins (i.e. 512). Anyway the same check
* will be done again later the risk seems low.
*/
if (folio_expected_ref_count(folio) != folio_ref_count(folio)) {
result = SCAN_PAGE_COUNT;
goto out_unmap;
}
```
Cheers,
Lance

I see the corner cases from try_to_unmap_one():

if (folio_test_dirty(folio) &&
!(vma->vm_flags & VM_DROPPABLE)) {
/*
* redirtied either using the
page table or a previously
* obtained GUP reference.
*/
set_ptes(mm, address,
pvmw.pte, pteval, nr_pages);
folio_set_swapbacked(folio);
goto walk_abort;
}

Should we take these two corner cases into account?

+ result = SCAN_PAGE_LAZYFREE;
+ goto out;
+ }
+
/* See hpage_collapse_scan_pmd(). */
if (folio_maybe_mapped_shared(folio)) {
++shared;
@@ -1335,6 +1342,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
}
folio = page_folio(page);

+ if (cc->is_khugepaged && !pte_dirty(pteval) &&
+ folio_test_lazyfree(folio)) {
+ result = SCAN_PAGE_LAZYFREE;
+ goto out_unmap;
+ }
+
if (!folio_test_anon(folio)) {
result = SCAN_PAGE_ANON;
goto out_unmap;

Thanks
Barry