Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator

From: Dev Jain

Date: Thu Dec 11 2025 - 10:35:58 EST



On 11/12/25 8:58 pm, Ryan Roberts wrote:
On 10/12/2025 22:28, Vishal Moola (Oracle) wrote:
On Wed, Dec 10, 2025 at 01:21:22PM +0000, Ryan Roberts wrote:
Hi Vishal,


On 21/10/2025 20:44, Vishal Moola (Oracle) wrote:
Sometimes, vm_area_alloc_pages() will want many pages from the buddy
allocator. Rather than making requests to the buddy allocator for at
most 100 pages at a time, we can eagerly request large order pages a
smaller number of times.

We still split the large order pages down to order-0 as the rest of the
vmalloc code (and some callers) depend on it. We still defer to the bulk
allocator and fallback path in case of order-0 pages or failure.

Running 1000 iterations of allocations on a small 4GB system finds:

1000 2mb allocations:
[Baseline] [This patch]
real 46.310s real 0m34.582
user 0.001s user 0.006s
sys 46.058s sys 0m34.365s

10000 200kb allocations:
[Baseline] [This patch]
real 56.104s real 0m43.696
user 0.001s user 0.003s
sys 55.375s sys 0m42.995s
I'm seeing some big vmalloc micro benchmark regressions on arm64, for which
bisect is pointing to this patch.
Ulad had similar findings/concerns[1]. Tldr: The numbers you are seeing
are expected for how the test module is currently written.
Hmm... simplistically, I'd say that either the tests are bad, in which case they
should be deleted, or they are good, in which case we shouldn't ignore the
regressions. Having tests that we learn to ignore is the worst of both worlds.

AFAICR the test does some million-odd iterations by default, which is the real problem.
On my RFC [1] I notice that reducing the iterations reduces the regression - till
some multiple of ten thousand iterations, the regression is zero. Doing this
alloc->free a million freaking times messes up the buddy badly.

[1] https://lore.kernel.org/all/20251112110807.69958-1-dev.jain@xxxxxxx/


But I see your point about the allocation pattern not being very realistic.

The tests are all originally from the vmalloc_test module. Note that (R)
indicates a statistically significant regression and (I) indicates a
statistically improvement.

p is number of pages in the allocation, h is huge. So it looks like the
regressions are all coming for the non-huge case, where we want to split to
order-0.

+---------------------------------+----------------------------------------------------------+------------+------------------------+
| Benchmark | Result Class | 6-18-0 | 6-18-0-gc2f2b01b74be |
+=================================+==========================================================+============+========================+
| micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec) | 514126.58 | (R) -42.20% |
| | fix_size_alloc_test: p:1, h:0, l:500000 (usec) | 320458.33 | -0.02% |
| | fix_size_alloc_test: p:4, h:0, l:500000 (usec) | 399680.33 | (R) -23.43% |
| | fix_size_alloc_test: p:16, h:0, l:500000 (usec) | 788723.25 | (R) -23.66% |
| | fix_size_alloc_test: p:16, h:1, l:500000 (usec) | 979839.58 | -1.05% |
| | fix_size_alloc_test: p:64, h:0, l:100000 (usec) | 481454.58 | (R) -23.99% |
| | fix_size_alloc_test: p:64, h:1, l:100000 (usec) | 615924.00 | (I) 2.56% |
| | fix_size_alloc_test: p:256, h:0, l:100000 (usec) | 1799224.08 | (R) -23.28% |
| | fix_size_alloc_test: p:256, h:1, l:100000 (usec) | 2313859.25 | (I) 3.43% |
| | fix_size_alloc_test: p:512, h:0, l:100000 (usec) | 3541904.75 | (R) -23.86% |
| | fix_size_alloc_test: p:512, h:1, l:100000 (usec) | 3597577.25 | (R) -2.97% |
| | full_fit_alloc_test: p:1, h:0, l:500000 (usec) | 487021.83 | (I) 4.95% |
| | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | 344466.33 | -0.65% |
| | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | 342484.25 | -1.58% |
| | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec) | 4034901.17 | (R) -25.35% |
| | pcpu_alloc_test: p:1, h:0, l:500000 (usec) | 195973.42 | 0.57% |
| | random_size_align_alloc_test: p:1, h:0, l:500000 (usec) | 643489.33 | (R) -47.63% |
| | random_size_alloc_test: p:1, h:0, l:500000 (usec) | 2029261.33 | (R) -27.88% |
| | vm_map_ram_test: p:1, h:0, l:500000 (usec) | 83557.08 | -0.22% |
+---------------------------------+----------------------------------------------------------+------------+------------------------+

I have a couple of thoughts from looking at the patch:

- Perhaps split_page() is the bulk of the cost? Previously for this case we
were allocating order-0 so there was no split to do. For h=1, split would
have already been called so that would explain why no regression for that
case?
For h=1, this patch shouldn't change (as long as nr_pages <
arch_vmap_{pte,pmd}_supported_shift). This is why you don't see regressions
in those cases.
arm64 supports 64K contigous-mappings with vmalloc so once nr_pages >= 16 we can
take the huge path.

- I guess we are bypassing the pcpu cache? Could this be having an effect? Dev
(cc'ed) did some similar investigation a while back and saw increased vmalloc
latencies when bypassing pcpu cache.
I'd say this is more a case of this test module targeting the pcpu
cache. The module allocates then frees one at a time, which promotes
reusing pcpu pages. [1] Has some numbers after modifying the test such
that all the allocations are made before freeing any.
OK fair enough.

We are seeing a bunch of other regressions in higher level benchmarks too; but
haven't yet concluded what's causing those. I'll report back if this patch looks
connected.

Thanks,
Ryan


- Philosophically is allocating physically contiguous memory when it is not
strictly needed the right thing to do? Large physically contiguous blocks are
a scarce resource so we don't want to waste them. Although I guess it could
be argued that this actually preserves the contiguous blocks because the
lifetime of all the pages is tied together. Anyway, I doubt this is the
This was the primary incentive for this patch :)

reason for the slow down, since those benchmarks are not under memory
pressure.

Anyway, it would be good to resolve the performance regressions if we can.
Imo, the appropriate way to address these is to modify the test module
as seen in [1].

[1] https://lore.kernel.org/linux-mm/aPJ6lLf24TfW_1n7@milan/