Re: [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization

From: Kiryl Shutsemau

Date: Thu Dec 11 2025 - 10:08:15 EST

On Wed, Dec 10, 2025 at 11:39:24AM +0800, Muchun Song wrote:
>
>
> > On Dec 9, 2025, at 22:44, Kiryl Shutsemau <kas@xxxxxxxxxx> wrote:
> >
> > On Tue, Dec 09, 2025 at 02:22:28PM +0800, Muchun Song wrote:
> >> The prerequisite is that the starting address of vmemmap must be aligned to
> >> 16MB boundaries (for 1GB huge pages). Right? We should add some checks
> >> somewhere to guarantee this (not compile time but at runtime like for KASLR).
> >
> > I have hard time finding the right spot to put the check.
> >
> > I considered something like the patch below, but it is probably too late
> > if we boot preallocating huge pages.
> >
> > I will dig more later, but if you have any suggestions, I would
> > appreciate.
>
> If you opt to record the mask information, then even when HVO is
> disabled compound_head will still compute the head-page address
> by means of the mask. Consequently this constraint must hold for
> **every** compound page.
>
> Therefore adding your code in hugetlb_vmemmap.c is not appropriate:
> that file only turns HVO off, yet the calculation remains broken
> for all other large compound pages.
>
> From MAX_FOLIO_ORDER we know that folio_alloc_gigantic() can allocate
> at most 16 GB of physically contiguous memory. We must therefore
> guarantee that the vmemmap area starts on an address aligned to at
> least 256 MB.
>
> When KASLR is disabled the vmemmap base is normally fixed by a
> macro, so the check can be done at compile time; when KASLR is enabled
> we have to ensure that the randomly chosen offset is a multiple
> of 256 MB. These two spots are, in my view, the places that need
> to be changed.
>
> Moreover, this approach requires the virtual addresses of struct
> page (possibly spanning sections) to be contiguous, so the method is
> valid **only** under CONFIG_SPARSEMEM_VMEMMAP.
>
> Also, when I skimmed through the overall patch yesterday, one detail
> caught my eye: the shared tail page is **not** "per hstate"; it is
> "per hstate, per zone, per node", because the zone and node
> information is encoded in the tail page’s flags field. We should make
> sure both page_to_nid() and page_zone() work properly.

Right. Or we can slap compound_head() inside them.

I stepped onto VM_BUG_ON_PAGE() in get_pfnblock_bitmap_bitidx().
Workarounded with compound_head() for now.

I am not sure if we want to allocate them per-zone. Seems excessive.
But per-node is reasonable.

--
Kiryl Shutsemau / Kirill A. Shutemov