Re: [RFC PATCH 0/8] Introducte Reserved THP

From: Qi Zheng

Date: Mon Jun 29 2026 - 06:17:28 EST


Hi Matthew,

Thanks a lot for your feedback!

On 6/29/26 11:46 AM, Matthew Wilcox wrote:
On Sat, Jun 27, 2026 at 03:21:48PM +0800, Qi Zheng wrote:
This RFC patchset introduces a new feature called "Reserved THP", and I'd like
to open up a discussion on how to use this as a stepping stone toward unifying
HugeTLB and THP (Transparent Huge Page).

I'm really happy you're looking into this. I'm not terribly familiar
with the page allocator code, so I don't have any comments on the
patches themselves, but I do have a few on your approach.

This is also what I am hoping for. The current version of the code is
just proof-of-concept (PoC) to facilitate discussion. The real goal is
to use reserved THP as a stepping stone to discuss the challages of
unifying HugeTLB and THP, and the overall evolution path.

Of course, swap support is a key part too. ;)


Therefore, we are wondering if we can introduce "reserved THP", which is THP
that can be reserved. It can be consumed through methods like madvise(), while
normal memory allocation cannot consume it. This can achieve an effect similar
to hugetlb. And because it is THP, it can relatively easily support swap
features, which perfectly solves the above problem.

As I understand it, hugetlbfs reserves on mmap().

Exactly, hugetlbfs reserves HugeTLB pages at mmap() time:

hugetlbfs_file_mmap
--> hugetlb_reserve_pages

and it's the same without using hugetlbfs:

hugetlb_file_setup
--> hugetlb_reserve_pages

Using madvise() as the example is based on the following considerations:

1. It closely aligns with the existing usage patterns of THP madvise
mode.
2. To properly support swap, we actually need to allow overcommit before
actual page faults occur. This allows us to perform memory reclaim
during the page fault, swapping out cold reserved THP to satisfy the
memory demands of new process. So we can't directly pre-reserv the
reserved THP at mmap/madvise time.

The second point seems to be a challenge that HugeTLB would also face if
it were to support swap. Perhaps reserved THP could be designed with two
modes:

1. with swap support: using the current madvise method.
2. without swap support: in this mode, we can directly let hugetlbfs
reserve the reserved THP at mmap() time. The behavior remains the
same, purely switching the underlying backend.

But this might muddy the semantics a bit...


This RFC wants to discuss another implementation:

1. Introduce a new migratetype: MIGRATE_RESERVED_THP.
2. Introduce two new hugetlb-like kernel boot parameters: `thp_reserved_size`
and `thp_reserved_nr`. When set, the required memory is marked as
MIGRATE_RESERVED_THP and put back into the buddy allocator.
3. Introduce a new madvise parameter: `MADV_RESERVED_THP`. Pages marked as
MIGRATE_RESERVED_THP can only be consumed via `madvise(MADV_RESERVED_THP)`.
Other normal memory allocations cannot consume MIGRATE_RESERVED_THP memory.

This can achieve a reservation effect similar to HugeTLB and guarantee
allocation success.

I think this is an interesting approach. I don't think it should be too
hard to migrate existing hugetlbfs users to it.

That is also what I hope to see.


3. Future Plans
===============

3.1 Enhance swap-out and swap-in for large folios
-------------------------------------------------

Currently, For swap-out, THP_SWAP is supported, but it only tries to swap out
the THP folio as a whole. It is still possible to be forced to split in some
situations (e.g., fragmented swap space, memory.swap.max limits, etc). For
swap-in, it is almost impossible to directly swap in the THP folio as a whole.

But for reserved THP, splitting is not allowed. We need to ensure that it
remains a whole huge page during swap-out and swap-in, to achieve a function
similar to hugetlb swap.

So I think the current restriction is something that needs to be fixed
anyway. It doesn't actually make sense that a folio must be written out
contiguously; filesystems do not have this restriction. I understand

Hopefully, there won't be too much pushback.

why swap currently has this limitation, but I'm hoping it gets removed
at some point. I'm not sure if the people working on swap right now
intend to fix this. They're already on the cc, so I hope they chime in.

+1.

Hi SWAP folks, how hard would it be to get this implemented? Are there
any current plans for this? ;)


3.2 Integrate reserved THP into the common reclaim path
-------------------------------------------------------

Once swap-in and swap-out of huge pages can be supported without splitting,
reserved THP can be integrated into the common reclaim path as a normal LRU
folio for memory reclamation. This fills the gap of the hugetlb swap function.

Hm. Then what does "reserved THP" mean if they can be swapped out?

Indeed, it is a bit weird.

In this version, what's actually reserved is essentially a memory pool.
After a reserved THP page is swapped out, the space in the pool might be
consumed by someone else. So, there's no guarantee that this reserved
THP page can be successfully swapped back in.

But if we don't want it swapped out, it can be guaranteed via mlock or
GUP.


3.4 Use reserved THP as a backend for hugetlbfs
-----------------------------------------------

This would allow existing hugetlb users or applications to seamlessly switch to
reserved THP.

If this is the end goal, then I think introducing new command line
options is probably the wrong approach right now. Instead, "reserved
THPs" should be allocated from the same pool as hugetlb reserve. That
way we're not jerking sysadmins around.

Do you mean reusing the existing HugeTLB boot parameters instead of
introducing new ones? That seems quite difficult to implement during the
transition. My idea is that we can eventually drop the HugeTLB boot
parameters entirely, so the system will still end up with only one set
of parameters. ;)


3.5 Add 1GB page support to reserved THP
----------------------------------------

Historically, there have been several attempts to add 1GB huge page support to
THP:

1. https://lore.kernel.org/linux-mm/20260202005451.774496-1-usamaarif642@xxxxxxxxx/
2. https://lore.kernel.org/linux-mm/20210224223536.803765-1-zi.yan@xxxxxxxx/

Adding 1GB huge page support for reserved THP would be relatively simpler
compared to regular THP.

Well. Maybe? What happens if we mmap() 16GiB,

At least the side effects are limited strictly to reserved THPs, and
reserved THP is pre-reserved, ensuring a higher allocation success rate.

madvise(USE_RESERVED_THPS) and then munmap() the first 4KiB of it?

Since splitting is not allowed for reserved THPs, the entire huge page
will be freed at munmap time.


3.6 Remove Hugetlb
------------------

Once reserved THP can completely replace the existing functions of hugetlb, we
can gradually remove Hugetlb, leaving only one huge page management system in
the kernel.

We also need mshare to land ... but yes, eventually removing hugetlbfs

mshare? Do you mean CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING?

is my hope.

+1.

Thanks,
Qi