Re: [REGRESSION] slab: replace cpu (partial) slabs with sheaves

From: Vlastimil Babka (SUSE)

Date: Thu Mar 26 2026 - 14:37:24 EST

On 3/26/26 19:16, Uladzislau Rezki wrote:
> On Thu, Mar 26, 2026 at 03:42:02PM +0100, Vlastimil Babka (SUSE) wrote:
>> On 3/26/26 13:43, Aishwarya Rambhadran wrote:
>> > Hi Vlastimil, Harry,
>>
>> Hi!
>>
>> > We have observed few kernel performance benchmark regressions,
>> > mainly in perf & vmalloc workloads, when comparing v6.19 mainline
>> > kernel results against later releases in the v7.0 cycle.
>> > Independent bisections on different machines consistently point
>> > to commits within the slab percpu sheaves series. However, towards
>> > the end of the bisection, the signal becomes less clear, so it's
>> > not yet certain which specific commit within the series is the
>> > root cause.
>> >
>> > The workloads were triggered on AWS Graviton3 (arm64) & AWS Intel
>> > Sapphire Rapids (x86_64) systems in which the regressions are
>> > reproducible across different kernel release candidates.
>> > (R)/(I) mean statistically significant regression/improvement,
>> > where "statistically significant" means the 95% confidence
>> > intervals do not overlap”.
>> >
>> > Below given are the performance benchmark results generated by
>> > Fastpath Tool, for different kernel -rc versions relative to the
>> > base version v6.19, executed on the mentioned SUTs. The perf/
>> > syscall benchmarks (execve/fork) regress consistently by ~6–11% on
>> > both arm64 and x86_64 across v7.0-rc1 to rc5, while vmalloc
>> > workloads show smaller but stable regressions (~2–10%), particularly
>> > in kvfree_rcu paths.
>> >
>> > Regressions on AWS Intel Sapphire Rapids (x86_64) :
>>
>> The table formatting is broken for me, can you resend it please? Maybe a
>> .txt attachment would work better.
>>
>> > +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
>> > | Benchmark | Result Class | 6-19-0 (base) |
>> > 7-0-0-rc1 | 7-0-0-rc2 | 7-0-0-rc2-gaf4e9ef3d784 | 7-0-0-rc3 |
>> > 7-0-0-rc4 | 7-0-0-rc5 |
>> > +=================+==========================================================+=================+=============+=============+===========================+=============+=============+=============+
>> > | micromm/vmalloc | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000
>> > (usec) | 262605.17 | -4.94% | -7.48% | (R)
>> > -8.11% | -4.51% | -6.23% | -3.47% |
>> > | | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000
>> > (usec) | 253198.67 | -7.56% | (R) -10.57% | (R)
>> > -10.13% | (R) -7.07% | -6.37% | -6.55% |
>> > | | pcpu_alloc_test: p:1, h:0, l:500000 (usec)
>> > | 197904.67 | -2.07% | -3.38% | -2.07% |
>> > -2.97% | (R) -4.30% | -3.39% |
>> > | | random_size_align_alloc_test: p:1, h:0, l:500000
>> > (usec) | 1707089.83 | -2.63% | (R) -3.69% |
>> > (R) -3.25% | (R) -2.87% | -2.22% | (R) -3.63% |
>> > +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
>> > | perf/syscall | execve (ops/sec) | 1202.92 | (R)
>> > -7.15% | (R) -7.05% | (R) -7.03% | (R) -7.93% | (R) -6.51% |
>> > (R) -7.36% |
>> > | | fork (ops/sec) | 996.00 | (R)
>> > -9.00% | (R) -10.27% | (R) -9.92% | (R) -11.19% | (R) -10.69% |
>> > (R) -10.28% |
>> > +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
>> >
>> > Regressions on AWS Graviton3 (arm64) :
>> > +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
>> > | Benchmark | Result Class | 6-19-0 (base) |
>> > 7-0-0-rc1 | 7-0-0-rc2 | 7-0-0-rc2-gaf4e9ef3d784 | 7-0-0-rc3 |
>> > 7-0-0-rc4 | 7-0-0-rc5 |
>> > +=================+==========================================================+=================+=============+=============+===========================+=============+=============+=============+
>> > | micromm/vmalloc | fix_size_alloc_test: p:1, h:0, l:500000 (usec)
>> > | 320101.50 | (R) -4.72% | (R) -3.81% | (R)
>> > -5.05% | -3.06% | -3.16% | (R) -3.91% |
>> > | | fix_size_alloc_test: p:4, h:0, l:500000 (usec)
>> > | 522072.83 | (R) -2.15% | -1.25% | (R)
>> > -2.16% | (R) -2.13% | -2.10% | -1.82% |
>> > | | fix_size_alloc_test: p:16, h:0, l:500000 (usec)
>> > | 1041640.33 | -0.50% | (R) -2.04% |
>> > -1.43% | -0.69% | -1.78% | (R) -2.03% |
>> > | | fix_size_alloc_test: p:256, h:1, l:100000 (usec)
>> > | 2255794.00 | -1.51% | (R) -2.24% | (R)
>> > -2.33% | -1.14% | -0.94% | -1.60% |
>> > | | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000
>> > (usec) | 343543.83 | (R) -4.50% | (R) -3.54% | (R)
>> > -5.00% | (R) -4.88% | (R) -4.01% | (R) -5.54% |
>> > | | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000
>> > (usec) | 342290.33 | (R) -5.15% | (R) -3.24% | (R)
>> > -3.76% | (R) -5.37% | (R) -3.74% | (R) -5.51% |
>> > | | random_size_align_alloc_test: p:1, h:0, l:500000
>> > (usec) | 1209666.83 | -2.43% | -2.09% |
>> > -1.19% | (R) -4.39% | -1.81% | -3.15% |
>> > +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
>> > | perf/syscall | execve (ops/sec) | 1219.58 |
>> > | (R) -8.12% | (R) -7.37% | (R) -7.60% | (R) -7.86%
>> > | (R) -7.71% |
>> > | | fork (ops/sec) | 863.67 |
>> > | (R) -7.24% | (R) -7.07% | (R) -6.42% | (R) -6.93% |
>> > (R) -6.55% |
>> > +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
>> >
>> >
>> > The details of latest bisections that were carried out for the above
>> > listed regressions, are given below :
>> > -Graviton3 (arm64)
>> > good: v6.19 (05f7e89ab973)
>> > bad: v7.0-rc2 (11439c4635ed)
>> > workload: perf/syscall (execve)
>> > bisected to: f1427a1d6415 (“slab: make percpu sheaves compatible with
>> > kmalloc_nolock()/kfree_nolock()”)
>> >
>> > -Sapphire Rapids (x86_64)
>> > good: v6.19 (05f7e89ab973)
>> > bad: v7.0-rc3 (1f318b96cc84)
>> > workload: perf/syscall (fork)
>> > bisected to: f1427a1d6415 (“slab: make percpu sheaves compatible with
>> > kmalloc_nolock()/kfree_nolock()”)
>> >
>> > -Graviton3 (arm64)
>> > good: v6.19 (05f7e89ab973)
>> > bad: v7.0-rc3 (1f318b96cc84)
>> > workload: perf/syscall (execve)
>> > bisected to: f3421f8d154c (“slab: introduce percpu sheaves bootstrap”)
>>
>> Yeah none of these are likely to introduce the regression.
>> We've seen other reports from e.g. lkp pointing to later commits that remove
>> the cpu (partial) slabs. The theory is that on benchmarks that stress vma
>> and maple node caches (fork and execve are likely those), the introduction
>> of sheaves in 6.18 (for those caches only) resulted in ~doubled percpu
>> caching capacity (and likely associated performance increase) - by sheaves
>> backed by cpu (partial) slabs,. Removing the latter then looks like a
>> regression in isolation in the 7.0 series.
>>
>> A regression of vmalloc related to kvfree_rcu might be new. Although if it's
>> kvfree_rcu() of vmalloc'd objects, it would be weird. More likely they are
>> kvmalloc'd but small enough to be actually kmalloc'd? What are the details
>> of that test?
>>
> static int
> kvfree_rcu_2_arg_vmalloc_test(void)

Oh so that's what the test is measuring? Thanks for clarifying.

> {
> struct test_kvfree_rcu *p;
> int i;
>
> for (i = 0; i < test_loop_count; i++) {
> p = vmalloc(1 * PAGE_SIZE);
> if (!p)
> return -1;
>
> p->array[0] = 'a';
> kvfree_rcu(p, rcu);
> }
>
> return 0;
> }
>
> static bool kfree_rcu_sheaf(void *obj)
> {
> struct kmem_cache *s;
> struct slab *slab;
>
> if (is_vmalloc_addr(obj))
> return false;
>
> slab = virt_to_slab(obj);
> if (unlikely(!slab))
> return false;
>
> s = slab->slab_cache;
> if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id()))
> return __kfree_rcu_sheaf(s, obj);
>
> return false;
> }
>
> it does not go via sheaf since it is a vmalloc address.

Right so there should be just the overhead of the extra is_vmalloc_addr()
test. Possibly also the call of kfree_rcu_sheaf() if it's not inlined.
I'd say it's something we can just accept? It seems this is a unit test
being used as a microbenchmark, so it can be very sensitive even to such
details, but it should be negligible in practice.

>
> --
> Uladzislau Rezki