Re: [REGRESSION] slab: replace cpu (partial) slabs with sheaves

From: Vlastimil Babka (SUSE)

Date: Fri Mar 27 2026 - 07:24:44 EST


On 3/27/26 11:00, Harry Yoo (Oracle) wrote:
> On Fri, Mar 27, 2026 at 08:58:36AM +0000, Ryan Roberts wrote:
>> >>>>> On 3/26/26 13:43, Aishwarya Rambhadran wrote:
>> >>> Right so there should be just the overhead of the extra is_vmalloc_addr()
>> >>> test. Possibly also the call of kfree_rcu_sheaf() if it's not inlined.
>> >>> I'd say it's something we can just accept? It seems this is a unit test
>> >>> being used as a microbenchmark, so it can be very sensitive even to such
>> >>> details, but it should be negligible in practice.
>> >>
>> >> The perf/syscall cases might be a bit more concerning though? (those tests are
>> >> from "perf bench syscall fork|execve"). Yes they are microbenchmarks, but a 7%
>> >> increased cost for fork seems like something we'd want to avoid if we can.
>> >
>> > Sure, I tried to explain those in my first reply. Harry then linked to how
>> > that explanation can be verified. Hopefully it's really the same reason.
>>
>> Ahh sorry I missed your first email. We only added that benchmark from 6.19 so
>> don't have results for earlier kernels, but I'll ask Aishu to run it for 6.17
>> and 6.18 to see if the results correlate with your expectation.
>>
>> But from a high level perspective, a 7% regression on fork is not ideal even if
>> there was a 7% improvement in 6.18.

In retrospect it was an oversight not to disable the pre-existing cpu
caching layer immediately for sheaf-enabled caches in 6.18. Can't undo that
mistake now, unfortunately.

> If that improvement comes from the number of objects cached per CPU,
> I'm not sure if determining the default value (# of cached objs) based on
> "a point when microbenchmarks stop improving" is a reasonable measure
> because the default value affects all slab caches and will inevitably
> increase overall memory usage.

Yeah that's the thing, some workloads might just keep improving as you throw
more caching at them, but there's a memory usage cost to that.
A case of stress test doing nothing but forks might also not be
representative of performance of forks under normal workload where other
operations also happen, returning the related slab objects, so in the end it
doesn't expose the batch size that much.

> Hopefully we could discuss what a reasonable heuristic that
> "works for most situations" looks like, and allow users to tune it further
> based on their needs.
>
> As a side note, changing sheaf capacity at runtime is not supported yet
> (I'm working on it) and targeting at least before the next LTS.
>