Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator

From: Dev Jain

Date: Thu Dec 11 2025 - 22:56:03 EST

On 11/12/25 9:54 pm, Uladzislau Rezki wrote:

On Thu, Dec 11, 2025 at 09:13:28PM +0530, Dev Jain wrote:

On 11/12/25 9:09 pm, Uladzislau Rezki wrote:

On Thu, Dec 11, 2025 at 03:28:56PM +0000, Ryan Roberts wrote:

On 10/12/2025 22:28, Vishal Moola (Oracle) wrote:

On Wed, Dec 10, 2025 at 01:21:22PM +0000, Ryan Roberts wrote:

Hi Vishal,

On 21/10/2025 20:44, Vishal Moola (Oracle) wrote:

Sometimes, vm_area_alloc_pages() will want many pages from the buddy
allocator. Rather than making requests to the buddy allocator for at
most 100 pages at a time, we can eagerly request large order pages a
smaller number of times.

We still split the large order pages down to order-0 as the rest of the
vmalloc code (and some callers) depend on it. We still defer to the bulk
allocator and fallback path in case of order-0 pages or failure.

Running 1000 iterations of allocations on a small 4GB system finds:

1000 2mb allocations:
[Baseline] [This patch]
real 46.310s real 0m34.582
user 0.001s user 0.006s
sys 46.058s sys 0m34.365s

10000 200kb allocations:
[Baseline] [This patch]
real 56.104s real 0m43.696
user 0.001s user 0.003s
sys 55.375s sys 0m42.995s

I'm seeing some big vmalloc micro benchmark regressions on arm64, for which
bisect is pointing to this patch.

Ulad had similar findings/concerns[1]. Tldr: The numbers you are seeing
are expected for how the test module is currently written.

Hmm... simplistically, I'd say that either the tests are bad, in which case they
should be deleted, or they are good, in which case we shouldn't ignore the
regressions. Having tests that we learn to ignore is the worst of both worlds.

Uh.. Tests are for measure vmalloc performance and stressing. They can not be just
removed :) In some sense they are synthetic, from the other hand they allow to find
problems and bottle-necks + measure perf. You have identified regression with it :)

I think, the problem is in the

+ 14.05% 0.11% [kernel] [k] remove_vm_area
+ 11.85% 1.82% [kernel] [k] __alloc_frozen_pages_noprof
+ 10.91% 0.36% [kernel] [k] __get_vm_area_node
+ 10.60% 7.58% [kernel] [k] insert_vmap_area
+ 10.02% 4.67% [kernel] [k] get_page_from_freelist

get_page_from_freelist() call. With a patch it adds 10% of cycles on
top whereas without patch i do not see the symbol at all, i.e. pages
are obtained really fast from the pcp list, not from the body.

The question is, why high-order pages are not end-up in the pcp-cache?
I think it is due to the fact, that we split such pages and freeing them
as order-0 one.

Please take a look at my RFC:

https://lore.kernel.org/all/20251112110807.69958-1-dev.jain@xxxxxxx/

You are right, we allocate large folios but then split them up and free
them as basepages. In patch 2 I have proved (not rigorously) that pcp
draining is one of the issues.

You sent out RFC 12 of NOV :-/ I have missed those two patches from you,
even though you put me into "to".

Appreciate that you point me on your work. Let me have a look at this.

Could you please resend RFC based on latest code-base?

Yup I'll do that. I was trying to get some perf numbers from LTP - fsstress,
but the variance seems to be high on the system I am testing. I would
appreciate if you or someone can run some benchmarks (filesystem is what I
believe would benefit).

--
Uladzislau Rezki