Re: [PATCH 0/5] mm/khugepaged: add collapse hint machanism for khugepaged and use in mglru
From: Nico Pache
Date: Tue Jun 09 2026 - 06:36:23 EST
On Sat, May 30, 2026 at 10:33 PM Luka Bai <lukafocus@xxxxxxxxxx> wrote:
>
> Khugepaged is a background daemon for collapsing feasible pages together
> into a transparent hugepage in all sorts of orders up to PMD_ORDER. However,
> it doesn't have any preference in its collapsing and just iterate through
> all the qualified mm_struct, and scan their page tables from the beginning
> to the end. It is quite inefficient especially for large address spaces
> considering how slow the khugepaged can be, and may waste many hugepage
> resources collapsing memory areas that are seldomly accessed.
>
> We would like to give khugepaged some preference hints when we found
> certain areas are good condidates for collapsing. For example, if some memory
> areas are frequently accessed, then we know that it's valuable to merge
> them into a bigger folio since it will reduce many tlb misses.
>
> For example, MGLRU has walk_mm() and lru_gen_look_around() that are used to
> scan frequently accessed areas to save some works on rmap walking and
> generation elevation. By the same time, they are able to find those
> hot memory areas, it should be valuable to merge these areas into folios.
> MADV_COLLAPSE can be used, but that will cost too much time and will
> harm the performance of reclaimation and slow down the process that may
> enter the slow path of memory allocation. So the better choice shoule be to
> tell khugepaged to asynchronously do it.
>
> We add a khugepaged collapse hint framework in this patchset. The caller can
> call khugepaged_add_collapse_hint() to add hints for khugepaged to make it
> prioritize collapsing these specific address we found before doing Round-Robin
> scanning. Each mm_slot which belongs to a mm_struct in the previous
> mm_slots_hash is now a khugepaged_mm_slot, it comprises the old mm_slot
> struct and a number of NR_KHUGEPAGED_PRIORITY_LEVEL struct
> khugepaged_collapse_requests. The request struct for each mm_struct will
> be put in the global struct khugepaged_priority_queue with respect to its
> priority when __khugepaged_enter() is called on this mm (we give each mm request
> structs for hint dispersion and balancing across all the mm_structs that will
> be added in the future patches), and all the hints will be put in these request
> structs. Each hint will have the target address and the target vma struct. An
> example of the framework is like below:
>
> global collapse hints queues:
> prio 0 ------()----------------------------------()---------------
> mm_slot0(process A) mm_slot1(process B)
> | |
> hint0---hint1---hint2---hint3 hint4---hint5---hint6
>
> prio 1 ------()----------------------------------()---------------
> mm_slot0(process A) mm_slot1(process B)
> | |
> ------- hint7---hint8
>
> The khugepaged will try to scan queues from highest priority (which is prio 0 in
> the graph above) to the lowest priority (which is prio 1 in the graph), then go
> through the list, and check out all the struct khugepaged_mm_slot (which are the
> mm_slot0 and mm_slot1 in the graph above), so it will start from mm_slot0 in queue
> of priority 0. Then khugepaged will scan all the hints listed in the slot (hint0 ~
> hint3 in the above graph). After handling one hint (no mater success or fail on
> collapsing), the hint will be deleted. If one khugepaged_mm_slot doesn't have any
> hints in it, khugepaged will skip it and scan the next mm_slot in the same priority;
> if there is no hint in the queue of prio 0 anymore, khugepaged will scan the ones
> of prio 1; if there is no hint in any prio queues, it will fallback to do Round-Robin
> scanning like before.
>
> khugepaged_add_collapse_hint() is for adding hints, and it only gets called
> by walk_mm() and lru_gen_look_around() right now. In the future we may
> call it in more scenorios when we found hot memory areas. For example: in damon.
>
> We tested the performance by using valkey-server (based on redis) together with
> memtier_benchmark to simulate a gauss distribution on the get/set operations on
> a 160G, 64core x86 VM. The dataset is about 3G. After preloading db, the testing
> parameter was like below:
> memtier_benchmark -s 127.0.0.1 -p 6379 \
> --ratio=1:1 \
> --key-pattern=G:G \
> --key-minimum=1 --key-maximum=3000000 \
> --key-median=2000000 \
> --key-stddev=150000 \
> -d 1024 \
> -t 1 -c 10 \
> -n 2500000 \
> --pipeline=32 \
> --hide-histogram
>
> Since we wanted to see the influence of khugepaged collapse hints on the reduction of
> tlb misses, we made khugepaged do scanning every 1 second, and used the userspace
> interface to do walk_mm() for the cgroup which valkey-server was set into every 2 seconds.
> We made sure the server was all 4k pages before we run test, and only khugepaged could
> collapse them into large folios. We enable the anonymous THP of order 9, which is pmd
> size in most setup. We used perf stat to monitor the tlb misses statistics.
>
> After repeated tests, we could see dTLB-load-misses with a 13.50% reduction, and saw
> dTLB-store-misses with a 5% reduction compared to the setup without any collapse
> hint. The final throughput for the memtier_benchmark was about 2% to 5% improvement
> on average, which was not that obvious compared to the tlb miss reduction. We believed
> that was because there were too many factors to influence the final result of a random
> redis test, so the influence of tlb misses to the final throughput were compromised by
> other factors.
>
> Patch Details:
> ========
> * Patch 1 is to add the basic khugepaged hint framework like we introduced
> above. Details can be seen in the commit itself and the comments in the
> codes.
> * Patch 2 is to add a slab_cache for khugepaged_collapse_hint which can
> improve the performance of allocating and freeing the hints.
> * Patch 3 is to add a deduplication machanism for the hints so that we will
> not add a hint that points to a repeated address.
> * Patch 4 is to add the accounting for successful collapses initiated by
> hint or non-hint.
> * Patch 5 is to add the collapse hint in lru_gen_look_around() and walk_mm()
> of mglru.
>
> Thanks for reading. Comments and suggestions are very welcome!
Hi Luka,
I haven't reviewed the code yet, but the overall concept is
interesting (it should probably be a RFC first though, but that's
fine).
I had future plans for something similar as part of the thp=auto work;
however that requires significant thought and investigation into how
we can properly gather hints for collapse/split THP candidates. From
my perspective we'd want a more global structure/system outside of
khugepaged, that would directly call khugepaged (and others like
split, etc). It would also tie into the allocator so that at fault
time it could leverage the hints to make better decisions. My fear
with this series is that making a decision now might complicate future
work by adding complexity we may eventually want to remove for a
better solution.
If you have the chance perhaps you can lead a discussion on your
proposal at the biweekly MM alignment session.
+David Rientjes as he leads those discussions. We could use that time
to layout a plan for what needs to be done for this work, and for the
work surrounding thp=auto as I beleive they will be interdependent :)
Cheers,
-- Nico
>
> Signed-off-by: Luka Bai <lukabai@xxxxxxxxxxx>
> ---
> Luka Bai (5):
> mm/khugepaged: add framework for khugepaged collapse hint
> mm/khugepaged: use slab cache instead of normal kmalloc
> mm/khugepaged: add deduplication when adding new collapse hint
> mm/khugepaged: add accounting for successful hint or non-hint collapse
> mm/khugepaged: add khugepaged collapse hint in mglru reference checking
>
> include/linux/huge_mm.h | 2 +
> include/linux/khugepaged.h | 20 ++
> include/linux/mmzone.h | 17 +-
> mm/huge_memory.c | 4 +
> mm/khugepaged.c | 460 ++++++++++++++++++++++++++++++++++++++++++++-
> mm/rmap.c | 27 ++-
> mm/vmscan.c | 33 +++-
> 7 files changed, 549 insertions(+), 14 deletions(-)
> ---
> base-commit: e1af79f3291a268adf4e149e1faba3052743e898
> change-id: 20260530-thp_collapse_hint-ec92bd943797
>
> Best regards,
> --
> Luka Bai <lukabai@xxxxxxxxxxx>
>