Re: [RFC PATCH 0/8] Introducte Reserved THP
From: David Hildenbrand (Arm)
Date: Mon Jun 29 2026 - 08:25:06 EST
On 6/27/26 09:21, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>
>
> Hi all,
>
Hi,
> This RFC patchset introduces a new feature called "Reserved THP", and I'd like
> to open up a discussion on how to use this as a stepping stone toward unifying
> HugeTLB and THP (Transparent Huge Page).
>
> 1. Background
> =============
>
> Currently, two huge page solutions co-exist in the kernel:
>
> 1. HugeTLB: Supports reservation, guaranteeing successful allocation within the
> reserved pool. However, it does not support features like swap. And
> it is a relatively independent subsystem.
> 2. THP: Does not support reservation and may fail to allocate and fallback to
> small pages when system memory is fragmented, but it is more tightly
> integrated with mm core and supports features like swap.
>
> Both have their pros and cons. However, in one of our internal scenarios, it
> seems we need to combine the features of both to meet the requirements.
>
> In our internal scenario, a user process needs to reserve double the amount
> of Hugetlb memory due to hot-upgrade requirements. For example, if the
> process needs 16GB of Hugetlb, an additional 16GB is required during the
> hot-upgrade to satisfy memory allocations. After the upgrade, the old
> process exits and releases the 16GB of HugeTLB. Therefore, in most cases,
> the extra 16GB of HugeTLB is wasted.
>
> A straightforward idea is to use the Hugetlb CMA feature, reserving a total
> of 32GB of hugetlb_cma. During normal operation, 16GB is consumed, and the
> remaining 16GB can be used by other processes. During hot-upgrade, we could
> try to migrate the memory used by other processes to allocate the required
> extra 16GB of Hugetlb. This might work, but it still requires reserving 32GB
> of memory.
>
> We also found that during the hot upgrade, about 10GB of the old process's
> hugetlb is actually cold memory, which could theoretically be reclaimed. In
> extreme cases, we could reserve only 22GB of memory and reclaim the
> remaining 10GB during the hot upgrade. But unfortunately, hugetlb currently
> does not support swap, and supporting it seems quite difficult.
>
> Therefore, we are wondering if we can introduce "reserved THP", which is THP
> that can be reserved. It can be consumed through methods like madvise(), while
> normal memory allocation cannot consume it.
madvise(). Gah. No :)
> This can achieve an effect similar
> to hugetlb. And because it is THP, it can relatively easily support swap
> features, which perfectly solves the above problem.
No, this is the wrong approach. We really shouldn't be making the same mistake
hugetlb did and support reserving of non-filebacked memory (IOW anonymous memory).
And even for files, the hugetlb mechanism is an absolute trainwreck, because
it's not NUMA aware.
This really needs some proper thought.
>
> Additionally, in 2024 (or possibly earlier), there have been discussions about
> the possibility of unifying Hugetlb and THP:
>
> Link: https://lwn.net/Articles/974491/
>
> After all, hugetlb's management is relatively independent and requires too
> much special handling in mm core. The introduction of reserved THP might be
> an opportunity. In the future, reserved THP could be enhanced to support
> various hugetlb features, such as acting as a backend for hugetlbfs. When
> reserved THP can completely replace HugeTLB, HugeTLB could be entirely
> removed, and reserved THP would just become a feature of THP.
>
> 2. Implementation
> =================
>
> In 2024, Yu Zhao proposed a similar idea:
>
> Link: https://lore.kernel.org/all/20240229183436.4110845-2-yuzhao@xxxxxxxxxx/
>
> The idea was to introduce two virt zones: ZONE_NOSPLIT and ZONE_NOMERGE to
> guarantee the allocation success rate of THP, achieving an effect similar to
> reservation. However, it seems there was no further progress, perhaps because of
> reluctance to introduce more virt zones like ZONE_MOVABLE.
>
> This RFC wants to discuss another implementation:
>
> 1. Introduce a new migratetype: MIGRATE_RESERVED_THP.
> 2. Introduce two new hugetlb-like kernel boot parameters: `thp_reserved_size`
> and `thp_reserved_nr`. When set, the required memory is marked as
> MIGRATE_RESERVED_THP and put back into the buddy allocator.
I'm all for some mechanism to make runtime allocation of large chunks of memory
easier, by adding a pool from where multiple consumers (THP, guest_memfd,
hugetlb, whatever) can allocate memory.
Call me very skeptical of getting the page allocator involved like this. (I hate it)
> 3. Introduce a new madvise parameter: `MADV_RESERVED_THP`. Pages marked as
> MIGRATE_RESERVED_THP can only be consumed via `madvise(MADV_RESERVED_THP)`.
> Other normal memory allocations cannot consume MIGRATE_RESERVED_THP memory.
Definitely no.
>
> This can achieve a reservation effect similar to HugeTLB and guarantee
> allocation success.
>
> 3. Future Plans
> ===============
>
> 3.1 Enhance swap-out and swap-in for large folios
> -------------------------------------------------
>
> Currently, For swap-out, THP_SWAP is supported, but it only tries to swap out
> the THP folio as a whole. It is still possible to be forced to split in some
> situations (e.g., fragmented swap space, memory.swap.max limits, etc). For
> swap-in, it is almost impossible to directly swap in the THP folio as a whole.
>
> But for reserved THP, splitting is not allowed. We need to ensure that it
> remains a whole huge page during swap-out and swap-in, to achieve a function
> similar to hugetlb swap.
>
>
> 3.2 Integrate reserved THP into the common reclaim path
> -------------------------------------------------------
>
> Once swap-in and swap-out of huge pages can be supported without splitting,
> reserved THP can be integrated into the common reclaim path as a normal LRU
> folio for memory reclamation. This fills the gap of the hugetlb swap function.
>
> 3.3 Use reserved THP as a backend for shmem/tmpfs
> -------------------------------------------------
>
> This would allow shared or file-like usage to utilize reserved THP.
>
Really, any kind of reservation should be file-centric and have some level of
control.
And soon the question would pop up "but how can we control this inside memcgs".
This all needs some thought.
> 3.4 Use reserved THP as a backend for hugetlbfs
> -----------------------------------------------
>
> This would allow existing hugetlb users or applications to seamlessly switch to
> reserved THP.
You are really talking about a memory pool that can be used by different consumers.
I raised that in the past in the context of guest_memfd, whereby the short-term
plan is to take pages from hugetlb's pool, when really there should be a global
pool that can be consumed by various consumers.
A lot of questions around that.
>
> 3.5 Add 1GB page support to reserved THP
> ----------------------------------------
>
> Historically, there have been several attempts to add 1GB huge page support to
> THP:
>
> 1. https://lore.kernel.org/linux-mm/20260202005451.774496-1-usamaarif642@xxxxxxxxx/
> 2. https://lore.kernel.org/linux-mm/20210224223536.803765-1-zi.yan@xxxxxxxx/
>
> Adding 1GB huge page support for reserved THP would be relatively simpler
> compared to regular THP.
And that's what I told Usama: start with 1 GiB THP support for shmem/tmpfs, and
make it configurable.
How we would add a reservation mechanism is a good question. Because hugetlb
reservation is a broken concept. And anything that's not NUMA or memcg aware
will be a broken concept I'm afraid.
>
> 3.6 Remove Hugetlb
> ------------------
>
> Once reserved THP can completely replace the existing functions of hugetlb, we
> can gradually remove Hugetlb, leaving only one huge page management system in
> the kernel.
I'm sorry, but no way this will work in any reasonable timeframe unless you
mimic the exact user facing ABI -- and I don't think we'll gain a lot that way.
I know, we all like to dream, but this just isn't feasible.
--
Cheers,
David