Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE

From: David Hildenbrand (Red Hat)

Date: Sun Dec 21 2025 - 04:24:16 EST


On 12/21/25 05:25, Vernon Yang wrote:
On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
On 12/19/25 06:29, Vernon Yang wrote:
On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
On 12/15/25 10:04, Vernon Yang wrote:
For example, create three task: hot1 -> cold -> hot2. After all three
task are created, each allocate memory 128MB. the hot1/hot2 task
continuously access 128 MB memory, while the cold task only accesses
its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
still prioritizes scanning the cold task and only scans the hot2 task
after completing the scan of the cold task.

So if the user has explicitly informed us via MADV_COLD/FREE that this
memory is cold or will be freed, it is appropriate for khugepaged to
scan it only at the latest possible moment, thereby avoiding unnecessary
scan and collapse operations to reducing CPU wastage.

Here are the performance test results:
(Throughput bigger is better, other smaller is better)

Testing on x86_64 machine:

| task hot2 | without patch | with patch | delta |
|---------------------|---------------|---------------|---------|
| total accesses time | 3.14 sec | 2.92 sec | -7.01% |
| cycles per access | 4.91 | 2.07 | -57.84% |
| Throughput | 104.38 M/sec | 112.12 M/sec | +7.42% |
| dTLB-load-misses | 288966432 | 1292908 | -99.55% |

Testing on qemu-system-x86_64 -enable-kvm:

| task hot2 | without patch | with patch | delta |
|---------------------|---------------|---------------|---------|
| total accesses time | 3.35 sec | 2.96 sec | -11.64% |
| cycles per access | 7.23 | 2.12 | -70.68% |
| Throughput | 97.88 M/sec | 110.76 M/sec | +13.16% |
| dTLB-load-misses | 237406497 | 3189194 | -98.66% |

Again, I also don't like that because you make assumptions on a full process
based on some part of it's address space.

E.g., if a library issues a MADV_COLD on some part of the memory the library
manages, why should the remaining part of the process suffer as well?

Yes, you make a good point, thanks!

This seems to be an heuristic focused on some specific workloads, no?

Right.

Could we use the VM_NOHUGEPAGE flag to indicate that this region should
not be collapsed, so that khugepaged can simply skip this VMA during
scanning? This way, it won't affect the remaining part of the task's
memory regions.

I thought we would skip these regions already properly in khugeapged, or
maybe I misunderstood your question.


I think we should, but seems we didn't do this for anonymous memory during
khugepaged.

We check the vma with thp_vma_allowable_order() during scan.

* For anonymous memory during khugepaged, if we always enable 2M collapse,
we will scan this vma. Even VM_NOHUGEPAGE is set.

* For other cases, it looks good since __thp_vma_allowable_order() will skip
this vma with vma_thp_disabled().

Hi David, Wei,

The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
memory during scan, as below:

khugepaged_scan_mm_slot()
thp_vma_allowable_order()
thp_vma_allowable_orders()
__thp_vma_allowable_orders()
vma_thp_disabled() {
if (vm_flags & VM_NOHUGEPAGE)
return true;
}

REAL ISSUE: when madvise(MADV_COLD),not set VM_NOHUGEPAGE flag to vma,
so the khugepaged will continue scan this vma.

I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
been successful. I will send it in the next version.

No we must not do that. That's a user-space visible change. :/

--
Cheers

David