[PATCH 0/5] mm/khugepaged: add collapse hint machanism for khugepaged and use in mglru
From: Luka Bai
Date: Sun May 31 2026 - 00:24:14 EST
Khugepaged is a background daemon for collapsing feasible pages together
into a transparent hugepage in all sorts of orders up to PMD_ORDER. However,
it doesn't have any preference in its collapsing and just iterate through
all the qualified mm_struct, and scan their page tables from the beginning
to the end. It is quite inefficient especially for large address spaces
considering how slow the khugepaged can be, and may waste many hugepage
resources collapsing memory areas that are seldomly accessed.
We would like to give khugepaged some preference hints when we found
certain areas are good condidates for collapsing. For example, if some memory
areas are frequently accessed, then we know that it's valuable to merge
them into a bigger folio since it will reduce many tlb misses.
For example, MGLRU has walk_mm() and lru_gen_look_around() that are used to
scan frequently accessed areas to save some works on rmap walking and
generation elevation. By the same time, they are able to find those
hot memory areas, it should be valuable to merge these areas into folios.
MADV_COLLAPSE can be used, but that will cost too much time and will
harm the performance of reclaimation and slow down the process that may
enter the slow path of memory allocation. So the better choice shoule be to
tell khugepaged to asynchronously do it.
We add a khugepaged collapse hint framework in this patchset. The caller can
call khugepaged_add_collapse_hint() to add hints for khugepaged to make it
prioritize collapsing these specific address we found before doing Round-Robin
scanning. Each mm_slot which belongs to a mm_struct in the previous
mm_slots_hash is now a khugepaged_mm_slot, it comprises the old mm_slot
struct and a number of NR_KHUGEPAGED_PRIORITY_LEVEL struct
khugepaged_collapse_requests. The request struct for each mm_struct will
be put in the global struct khugepaged_priority_queue with respect to its
priority when __khugepaged_enter() is called on this mm (we give each mm request
structs for hint dispersion and balancing across all the mm_structs that will
be added in the future patches), and all the hints will be put in these request
structs. Each hint will have the target address and the target vma struct. An
example of the framework is like below:
global collapse hints queues:
prio 0 ------()----------------------------------()---------------
mm_slot0(process A) mm_slot1(process B)
| |
hint0---hint1---hint2---hint3 hint4---hint5---hint6
prio 1 ------()----------------------------------()---------------
mm_slot0(process A) mm_slot1(process B)
| |
------- hint7---hint8
The khugepaged will try to scan queues from highest priority (which is prio 0 in
the graph above) to the lowest priority (which is prio 1 in the graph), then go
through the list, and check out all the struct khugepaged_mm_slot (which are the
mm_slot0 and mm_slot1 in the graph above), so it will start from mm_slot0 in queue
of priority 0. Then khugepaged will scan all the hints listed in the slot (hint0 ~
hint3 in the above graph). After handling one hint (no mater success or fail on
collapsing), the hint will be deleted. If one khugepaged_mm_slot doesn't have any
hints in it, khugepaged will skip it and scan the next mm_slot in the same priority;
if there is no hint in the queue of prio 0 anymore, khugepaged will scan the ones
of prio 1; if there is no hint in any prio queues, it will fallback to do Round-Robin
scanning like before.
khugepaged_add_collapse_hint() is for adding hints, and it only gets called
by walk_mm() and lru_gen_look_around() right now. In the future we may
call it in more scenorios when we found hot memory areas. For example: in damon.
We tested the performance by using valkey-server (based on redis) together with
memtier_benchmark to simulate a gauss distribution on the get/set operations on
a 160G, 64core x86 VM. The dataset is about 3G. After preloading db, the testing
parameter was like below:
memtier_benchmark -s 127.0.0.1 -p 6379 \
--ratio=1:1 \
--key-pattern=G:G \
--key-minimum=1 --key-maximum=3000000 \
--key-median=2000000 \
--key-stddev=150000 \
-d 1024 \
-t 1 -c 10 \
-n 2500000 \
--pipeline=32 \
--hide-histogram
Since we wanted to see the influence of khugepaged collapse hints on the reduction of
tlb misses, we made khugepaged do scanning every 1 second, and used the userspace
interface to do walk_mm() for the cgroup which valkey-server was set into every 2 seconds.
We made sure the server was all 4k pages before we run test, and only khugepaged could
collapse them into large folios. We enable the anonymous THP of order 9, which is pmd
size in most setup. We used perf stat to monitor the tlb misses statistics.
After repeated tests, we could see dTLB-load-misses with a 13.50% reduction, and saw
dTLB-store-misses with a 5% reduction compared to the setup without any collapse
hint. The final throughput for the memtier_benchmark was about 2% to 5% improvement
on average, which was not that obvious compared to the tlb miss reduction. We believed
that was because there were too many factors to influence the final result of a random
redis test, so the influence of tlb misses to the final throughput were compromised by
other factors.
Patch Details:
========
* Patch 1 is to add the basic khugepaged hint framework like we introduced
above. Details can be seen in the commit itself and the comments in the
codes.
* Patch 2 is to add a slab_cache for khugepaged_collapse_hint which can
improve the performance of allocating and freeing the hints.
* Patch 3 is to add a deduplication machanism for the hints so that we will
not add a hint that points to a repeated address.
* Patch 4 is to add the accounting for successful collapses initiated by
hint or non-hint.
* Patch 5 is to add the collapse hint in lru_gen_look_around() and walk_mm()
of mglru.
Thanks for reading. Comments and suggestions are very welcome!
Signed-off-by: Luka Bai <lukabai@xxxxxxxxxxx>
---
Luka Bai (5):
mm/khugepaged: add framework for khugepaged collapse hint
mm/khugepaged: use slab cache instead of normal kmalloc
mm/khugepaged: add deduplication when adding new collapse hint
mm/khugepaged: add accounting for successful hint or non-hint collapse
mm/khugepaged: add khugepaged collapse hint in mglru reference checking
include/linux/huge_mm.h | 2 +
include/linux/khugepaged.h | 20 ++
include/linux/mmzone.h | 17 +-
mm/huge_memory.c | 4 +
mm/khugepaged.c | 460 ++++++++++++++++++++++++++++++++++++++++++++-
mm/rmap.c | 27 ++-
mm/vmscan.c | 33 +++-
7 files changed, 549 insertions(+), 14 deletions(-)
---
base-commit: e1af79f3291a268adf4e149e1faba3052743e898
change-id: 20260530-thp_collapse_hint-ec92bd943797
Best regards,
--
Luka Bai <lukabai@xxxxxxxxxxx>