Re: [REGRESSION] slab: replace cpu (partial) slabs with sheaves

From: Vlastimil Babka (SUSE)

Date: Thu Mar 26 2026 - 10:49:04 EST

On 3/26/26 13:43, Aishwarya Rambhadran wrote:
> Hi Vlastimil, Harry,

Hi!

> We have observed few kernel performance benchmark regressions,
> mainly in perf & vmalloc workloads, when comparing v6.19 mainline
> kernel results against later releases in the v7.0 cycle.
> Independent bisections on different machines consistently point
> to commits within the slab percpu sheaves series. However, towards
> the end of the bisection, the signal becomes less clear, so it's
> not yet certain which specific commit within the series is the
> root cause.
>
> The workloads were triggered on AWS Graviton3 (arm64) & AWS Intel
> Sapphire Rapids (x86_64) systems in which the regressions are
> reproducible across different kernel release candidates.
> (R)/(I) mean statistically significant regression/improvement,
> where "statistically significant" means the 95% confidence
> intervals do not overlap”.
>
> Below given are the performance benchmark results generated by
> Fastpath Tool, for different kernel -rc versions relative to the
> base version v6.19, executed on the mentioned SUTs. The perf/
> syscall benchmarks (execve/fork) regress consistently by ~6–11% on
> both arm64 and x86_64 across v7.0-rc1 to rc5, while vmalloc
> workloads show smaller but stable regressions (~2–10%), particularly
> in kvfree_rcu paths.
>
> Regressions on AWS Intel Sapphire Rapids (x86_64) :

The table formatting is broken for me, can you resend it please? Maybe a
.txt attachment would work better.

> +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
> | Benchmark | Result Class | 6-19-0 (base) |
> 7-0-0-rc1 | 7-0-0-rc2 | 7-0-0-rc2-gaf4e9ef3d784 | 7-0-0-rc3 |
> 7-0-0-rc4 | 7-0-0-rc5 |
> +=================+==========================================================+=================+=============+=============+===========================+=============+=============+=============+
> | micromm/vmalloc | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000
> (usec) | 262605.17 | -4.94% | -7.48% | (R)
> -8.11% | -4.51% | -6.23% | -3.47% |
> | | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000
> (usec) | 253198.67 | -7.56% | (R) -10.57% | (R)
> -10.13% | (R) -7.07% | -6.37% | -6.55% |
> | | pcpu_alloc_test: p:1, h:0, l:500000 (usec)
> | 197904.67 | -2.07% | -3.38% | -2.07% |
> -2.97% | (R) -4.30% | -3.39% |
> | | random_size_align_alloc_test: p:1, h:0, l:500000
> (usec) | 1707089.83 | -2.63% | (R) -3.69% |
> (R) -3.25% | (R) -2.87% | -2.22% | (R) -3.63% |
> +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
> | perf/syscall | execve (ops/sec) | 1202.92 | (R)
> -7.15% | (R) -7.05% | (R) -7.03% | (R) -7.93% | (R) -6.51% |
> (R) -7.36% |
> | | fork (ops/sec) | 996.00 | (R)
> -9.00% | (R) -10.27% | (R) -9.92% | (R) -11.19% | (R) -10.69% |
> (R) -10.28% |
> +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
>
> Regressions on AWS Graviton3 (arm64) :
> +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
> | Benchmark | Result Class | 6-19-0 (base) |
> 7-0-0-rc1 | 7-0-0-rc2 | 7-0-0-rc2-gaf4e9ef3d784 | 7-0-0-rc3 |
> 7-0-0-rc4 | 7-0-0-rc5 |
> +=================+==========================================================+=================+=============+=============+===========================+=============+=============+=============+
> | micromm/vmalloc | fix_size_alloc_test: p:1, h:0, l:500000 (usec)
> | 320101.50 | (R) -4.72% | (R) -3.81% | (R)
> -5.05% | -3.06% | -3.16% | (R) -3.91% |
> | | fix_size_alloc_test: p:4, h:0, l:500000 (usec)
> | 522072.83 | (R) -2.15% | -1.25% | (R)
> -2.16% | (R) -2.13% | -2.10% | -1.82% |
> | | fix_size_alloc_test: p:16, h:0, l:500000 (usec)
> | 1041640.33 | -0.50% | (R) -2.04% |
> -1.43% | -0.69% | -1.78% | (R) -2.03% |
> | | fix_size_alloc_test: p:256, h:1, l:100000 (usec)
> | 2255794.00 | -1.51% | (R) -2.24% | (R)
> -2.33% | -1.14% | -0.94% | -1.60% |
> | | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000
> (usec) | 343543.83 | (R) -4.50% | (R) -3.54% | (R)
> -5.00% | (R) -4.88% | (R) -4.01% | (R) -5.54% |
> | | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000
> (usec) | 342290.33 | (R) -5.15% | (R) -3.24% | (R)
> -3.76% | (R) -5.37% | (R) -3.74% | (R) -5.51% |
> | | random_size_align_alloc_test: p:1, h:0, l:500000
> (usec) | 1209666.83 | -2.43% | -2.09% |
> -1.19% | (R) -4.39% | -1.81% | -3.15% |
> +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
> | perf/syscall | execve (ops/sec) | 1219.58 |
> | (R) -8.12% | (R) -7.37% | (R) -7.60% | (R) -7.86%
> | (R) -7.71% |
> | | fork (ops/sec) | 863.67 |
> | (R) -7.24% | (R) -7.07% | (R) -6.42% | (R) -6.93% |
> (R) -6.55% |
> +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
>
>
> The details of latest bisections that were carried out for the above
> listed regressions, are given below :
> -Graviton3 (arm64)
> good: v6.19 (05f7e89ab973)
> bad: v7.0-rc2 (11439c4635ed)
> workload: perf/syscall (execve)
> bisected to: f1427a1d6415 (“slab: make percpu sheaves compatible with
> kmalloc_nolock()/kfree_nolock()”)
>
> -Sapphire Rapids (x86_64)
> good: v6.19 (05f7e89ab973)
> bad: v7.0-rc3 (1f318b96cc84)
> workload: perf/syscall (fork)
> bisected to: f1427a1d6415 (“slab: make percpu sheaves compatible with
> kmalloc_nolock()/kfree_nolock()”)
>
> -Graviton3 (arm64)
> good: v6.19 (05f7e89ab973)
> bad: v7.0-rc3 (1f318b96cc84)
> workload: perf/syscall (execve)
> bisected to: f3421f8d154c (“slab: introduce percpu sheaves bootstrap”)

Yeah none of these are likely to introduce the regression.
We've seen other reports from e.g. lkp pointing to later commits that remove
the cpu (partial) slabs. The theory is that on benchmarks that stress vma
and maple node caches (fork and execve are likely those), the introduction
of sheaves in 6.18 (for those caches only) resulted in ~doubled percpu
caching capacity (and likely associated performance increase) - by sheaves
backed by cpu (partial) slabs,. Removing the latter then looks like a
regression in isolation in the 7.0 series.

A regression of vmalloc related to kvfree_rcu might be new. Although if it's
kvfree_rcu() of vmalloc'd objects, it would be weird. More likely they are
kvmalloc'd but small enough to be actually kmalloc'd? What are the details
of that test?

> I'm aware that some fixes for the sheaves series have already been
> merged around v7.0-rc3; however, these do not appear to resolve the
> regressions described above completely. Are there additional fixes or
> follow-ups in progress that I should evaluate? I can investigate
> further and provide additional data, if that would be useful.

We have some followups planned for 7.1 that would make a difference for
systems with memoryless nodes. That would mean "numactl -H" shows nodes that
have cpus but no memory, or that memory is all ZONE_MOVABLE and not ZONE_NORMAL.

Thanks,
Vlastimil

> Thank you.
> Aishwarya Rambhadran
>
>
> On 23/01/26 12:22 PM, Vlastimil Babka wrote: