Re: [PATCH 2/2] mm/vmalloc: Add attempt_larger_order_alloc parameter

From: David Hildenbrand (Red Hat)

Date: Fri Dec 19 2025 - 03:33:43 EST

On 12/18/25 12:56, Ryan Roberts wrote:

+ David, Lorenzo, Matthew

Hoping someone might be able to explain to me how this all really works! :-|

On 18/12/2025 11:53, Ryan Roberts wrote:

On 18/12/2025 04:55, Dev Jain wrote:

On 17/12/25 8:50 pm, Ryan Roberts wrote:

On 17/12/2025 12:02, Uladzislau Rezki wrote:

On 16/12/2025 21:19, Uladzislau Rezki (Sony) wrote:

Introduce a module parameter to enable or disable the large-order
allocation path in vmalloc. High-order allocations are disabled by
default so far, but users may explicitly enable them at runtime if
desired.

High-order pages allocated for vmalloc are immediately split into
order-0 pages and later freed as order-0, which means they do not
feed the per-CPU page caches. As a result, high-order attempts tend
to bypass the PCP fastpath and fall back to the buddy allocator that
can affect performance.

However, when the PCP caches are empty, high-order allocations may
show better performance characteristics especially for larger
allocation requests.

I wonder if a better solution would be "allocate order-0 if available in pcp,
else try large order, else fallback to order-0" Could that provide the best of
all worlds without needing a configuration knob?

I am not sure, to me it looks like a bit odd.

Perhaps it would feel better if it was generalized to "first try allocation from
PCP list, highest to lowest order, then try allocation from the buddy, highest
to lowest order"?

Ideally it would be
good just free it as high-order page and not order-0 peaces.

Yeah perhaps that's better. How about something like this (very lightly tested
and no performance results yet):

(And I should admit I'm not 100% sure it is safe to call free_frozen_pages()
with a contiguous run of order-0 pages, but I'm not seeing any warnings or
memory leaks when running mm selftests...)

Wow I wasn't aware that we can do this. I see that free_hotplug_page_range() in
arm64/mmu.c already does this - it computes order from size and passes it to
__free_pages().

Hmm that looks dodgy to me. But I'm not sure I actually understand what is going
on...

Prior to looking at this yesterday, my understanding was this: At the struct
page level, you can either allocate compond or non-compound. order-0 is
non-compound by definition. A high-order non-compound page is just a contiguous
set of order-0 pages, each with individual reference counts and other meta data.

Not quite. A high-order non-compound allocation will only use the refcount of page[0].

When not returning that memory in the same order to the buddy, we first have to split that high-order allocation. That will initialize the refcounts and split page-owner data, alloc tag tracking etc.

A compound page is one where all the pages are tied together and managed as one
- the meta data is stored in the head page and all the tail pages point to the
head (this concept is wrapped by struct folio).

But after looking through the comments in page_alloc.c, it would seem that a
non-compound high-order page is NOT just a set of order-0 pages, but they still
share some meta data, including a shared refcount?? alloc_pages() will return
one of these things, and __free_pages() requires the exact same unit to be
provided to it.

Right.

vmalloc calls alloc_pages() to get a non-compound high-order page, then calls
split_page() to convert to a set of order-0 pages. See this comment:

/*
* split_page takes a non-compound higher-order page, and splits it into
* n (1<<order) sub-pages: page[0..n]
* Each sub-page must be freed individually.
*
* Note: this is probably too low level an operation for use in drivers.
* Please consult with lkml before using this in your driver.
*/
void split_page(struct page *page, unsigned int order)

So just passing all the order-0 pages directly to __free_pages() in one go is
definitely not the right thing to do ("Each sub-page must be freed
individually"). They may have different reference counts so you can only
actually free the ones that go to zero surely?

Yes.

But it looked to me like free_frozen_pages() just wants a naturally aligned
power-of-2 number of pages to free, so my patch below is decrementing the
refcount on each struct page and accumulating the ones where the refcounts goto
zero into suitable blocks for free_frozen_pages().

So I *think* my patch is correct, but I'm not totally sure.

Free in the granularity you allocated. :)

Then we have the ___free_pages(), which I find very difficult to understand:

static void ___free_pages(struct page *page, unsigned int order,
fpi_t fpi_flags)
{
/* get PageHead before we drop reference */
int head = PageHead(page);
/* get alloc tag in case the page is released by others */
struct alloc_tag *tag = pgalloc_tag_get(page);

if (put_page_testzero(page))
__free_frozen_pages(page, order, fpi_flags);

We only test the refcount for the first page, then free all the pages. So that
implies that non-compound high-order pages share a single refcount? Or we just
ignore the refcount of all the other pages in a non-compound high-order page?

else if (!head) {

What? If the first page still has references but but it's a non-compond
high-order page (i.e. no head page) then we free all the trailing sub-pages
without caring about their references?

Again, free in the granularity we allocated.

pgalloc_tag_sub_pages(tag, (1 << order) - 1);
while (order-- > 0) {
/*
* The "tail" pages of this non-compound high-order
* page will have no code tags, so to avoid warnings
* mark them as empty.
*/
clear_page_tag_ref(page + (1 << order));
__free_frozen_pages(page + (1 << order), order,
fpi_flags);
}
}
}

For the arm64 case that you point out, surely __free_pages() is the wrong thing
to call, because it's going to decrement the refcount. But we are freeing based
on their presence in the pagetable and we never took a reference in the first place.

HELP!

Hope my input helped, not sure if I answered the real question? :)

--
Cheers

David