Re: [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization

From: Muchun Song

Date: Fri Dec 12 2025 - 01:47:18 EST

> On Dec 11, 2025, at 23:08, Kiryl Shutsemau <kas@xxxxxxxxxx> wrote:
>
> On Wed, Dec 10, 2025 at 11:39:24AM +0800, Muchun Song wrote:
>>
>>
>>> On Dec 9, 2025, at 22:44, Kiryl Shutsemau <kas@xxxxxxxxxx> wrote:
>>>
>>> On Tue, Dec 09, 2025 at 02:22:28PM +0800, Muchun Song wrote:
>>>> The prerequisite is that the starting address of vmemmap must be aligned to
>>>> 16MB boundaries (for 1GB huge pages). Right? We should add some checks
>>>> somewhere to guarantee this (not compile time but at runtime like for KASLR).
>>>
>>> I have hard time finding the right spot to put the check.
>>>
>>> I considered something like the patch below, but it is probably too late
>>> if we boot preallocating huge pages.
>>>
>>> I will dig more later, but if you have any suggestions, I would
>>> appreciate.
>>
>> If you opt to record the mask information, then even when HVO is
>> disabled compound_head will still compute the head-page address
>> by means of the mask. Consequently this constraint must hold for
>> **every** compound page.
>>
>> Therefore adding your code in hugetlb_vmemmap.c is not appropriate:
>> that file only turns HVO off, yet the calculation remains broken
>> for all other large compound pages.
>>
>> From MAX_FOLIO_ORDER we know that folio_alloc_gigantic() can allocate
>> at most 16 GB of physically contiguous memory. We must therefore
>> guarantee that the vmemmap area starts on an address aligned to at
>> least 256 MB.
>>
>> When KASLR is disabled the vmemmap base is normally fixed by a
>> macro, so the check can be done at compile time; when KASLR is enabled
>> we have to ensure that the randomly chosen offset is a multiple
>> of 256 MB. These two spots are, in my view, the places that need
>> to be changed.
>>
>> Moreover, this approach requires the virtual addresses of struct
>> page (possibly spanning sections) to be contiguous, so the method is
>> valid **only** under CONFIG_SPARSEMEM_VMEMMAP.
>>
>> Also, when I skimmed through the overall patch yesterday, one detail
>> caught my eye: the shared tail page is **not** "per hstate"; it is
>> "per hstate, per zone, per node", because the zone and node
>> information is encoded in the tail page’s flags field. We should make
>> sure both page_to_nid() and page_zone() work properly.
>
> Right. Or we can slap compound_head() inside them.

At the same time, to keep users from accidentally passing compound_head()
a handcrafted-on-stack page struct (like snapshot_page()), Shall we add
a VM_BUG_ON() in compound_head() to validate that the page address falls
within the vmemmap range? Otherwise, compound_head() will return an invalid
head page struct (it is an address on the stack with arbitrary data).

>
> I stepped onto VM_BUG_ON_PAGE() in get_pfnblock_bitmap_bitidx().
> Workarounded with compound_head() for now.

I don’t see why you singled out get_pfnblock_bitmap_bitidx—what’s
special about that spot?

>
> I am not sure if we want to allocate them per-zone. Seems excessive.

Yes. If we could solve page_to_nid() and page_zonenum(), it does not
need to be per-zone.

> But per-node is reasonable.

Agree.

>
> --
> Kiryl Shutsemau / Kirill A. Shutemov