Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation

From: Harry Yoo

Date: Thu Mar 12 2026 - 00:12:46 EST

On Wed, Mar 11, 2026 at 06:15:51PM +0800, Ming Lei wrote:
> On Wed, Mar 11, 2026 at 10:10:13AM +0900, Harry Yoo wrote:
> > Hi Ming, thanks a lot for helping testing!
> >
> > The stats look quite fine to me, but we're still seeing suboptimal IOPS.
> >
> > > - slab stat on patched `815c8e35511d Merge branch 'slab/for-7.0/sheaves' into slab/for-next`
> >
> > Does that doesn't include Vlastimil's (fb1091febd66 mm/slab: allow sheaf
> > refill if blocking is not allowed)?
>
> No, because fb1091febd66 isn't included into `815c8e35511d Merge branch
> 'slab/for-7.0/sheaves'.

Ok. But the "mm/slab: allow sheaf refill if blocking is not allowed"
would impact the performance, so let's not forget to include that.

> > Next time when testing it, could you please test on top of 7.0-rc3 w/
> > the memoryless node patch (w/ the delta above) applied?
>
> IOPS is same between `815c8e35511d Merge branch 'slab/for-7.0/sheaves' into slab/for-next`
> and 7.0-rc3 with the two patches.

Thanks!

> IMO, it should be more easier to compare & investigate by focusing on
> 815c8e35511d, given there is only 41 patches between v6.19-rc5 and
> commit 815c8e35511d.

I was thinking that there might be another regression involved here
but yeah, apparently it's not...

> > Also, let us check a few things...
> >
> > 1) Does bumping up sheaf capacity change the slab stats & IOPS?
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 0c906fefc31b..5207279417e2 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -7611,13 +7611,13 @@ static unsigned int calculate_sheaf_capacity(struct kmem_cache *s,
> > * should result in similar lock contention (barn or list_lock)
> > */
> > if (s->size >= PAGE_SIZE)
> > - capacity = 4;
> > + capacity = 6;
> > else if (s->size >= 1024)
> > - capacity = 12;
> > + capacity = 24;
> > else if (s->size >= 256)
> > - capacity = 26;
> > + capacity = 52;
> > else
> > - capacity = 60;
> > + capacity = 120;
> >
> > /* Increment capacity to make sheaf exactly a kmalloc size bucket */
> > size = struct_size_t(struct slab_sheaf, objects, capacity);
>
> IOPS can be increased from 24M to 29M with this patch, against 7.0-rc3 with
> Vlastimil's today patchset.

Oh, thanks!

Could you please try to keep increasing the numbers until the
performance stops improving?

It might or might not reach the original performance,
but that would be good to know.

> > 2) Is there any change in NUMA locality between v6.19 vs. v7.0-rc3 (patched)?
> > (e.g., measured via
> > perf stat -e node-loads,node-load-misses,node-stores,node-store-misses)
>
> root@tomsrv:~/temp/mm/7.0-rc3/patched# perf stat -a -e node-loads,node-load-misses,node-stores,node-store-misses
> Error:
> No supported events found.
> The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (node-loads).
> "dmesg | grep -i perf" may provide additional information.
>
> Looks the events are not supported on AMD Zen4 machine.

Ouch.

> > 3) It's quite strange that blk_mq_sched_bio_merge() completely
> > disappeared in v7.0-rc2 profile [1] . Is there any change
> > in read/write io merge rate? (/proc/diskstats) between v6.19 and
> > v7.0-rc3?
>
> It isn't strange.
>
> Because IOPS drops to 13M on v7.0-rc2 from 34M on v6.19-rc5, so blk_mq_sched_bio_merge
> can't be shown obviously, which code path is run for each bio(IO).
>
> It is one totally random READ IO, and IO merge shouldn't happen.

I missed that point. Thanks!

--
Cheers,
Harry / Hyeonggon