Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
From: Ming Lei
Date: Wed Mar 11 2026 - 06:28:21 EST
On Wed, Mar 11, 2026 at 10:10:13AM +0900, Harry Yoo wrote:
> On Fri, Mar 06, 2026 at 06:22:37PM +0800, Ming Lei wrote:
> > On Fri, Mar 06, 2026 at 09:47:27AM +0100, Vlastimil Babka (SUSE) wrote:
> > > On 3/6/26 05:55, Harry Yoo wrote:
> > > > On Thu, Feb 26, 2026 at 07:02:11PM +0100, Vlastimil Babka (SUSE) wrote:
> > > >> On 2/25/26 10:31, Ming Lei wrote:
> > > >> > Hi Vlastimil,
> > > >> >
> > > >> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> > > >> >> On 2/24/26 21:27, Vlastimil Babka wrote:
> > > >> >> >
> > > >> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
> > > >> >> > didn't anticipate this interaction with mempools. We could change them
> > > >> >> > but there might be others using a similar pattern. Maybe it would be for
> > > >> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
> > > >> >> > (but carefully as some deadlock avoidance depends on it, we might need
> > > >> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> > > >> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> > > >> >> Could you try this then, please? Thanks!
> > > >> >
> > > >> > Thanks for working on this issue!
> > > >> >
> > > >> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> > > >> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
> > > >>
> > > >> what about this patch in addition to the previous one? Thanks.
> > > >>
> > > >> ----8<----
> > > >> From d3e8118c078996d1372a9f89285179d93971fdb2 Mon Sep 17 00:00:00 2001
> > > >> From: "Vlastimil Babka (SUSE)" <vbabka@xxxxxxxxxx>
> > > >> Date: Thu, 26 Feb 2026 18:59:56 +0100
> > > >> Subject: [PATCH] mm/slab: put barn on every online node
> > > >>
> > > >> Including memoryless nodes.
> > > >>
> > > >> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@xxxxxxxxxx>
> > > >> ---
> > > >
> > > > Just taking a quick grasp...
> > > >
> > > >> @@ -6121,7 +6122,8 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
> > > >> if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> > > >> return;
> > > >>
> > > >> - if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
> > > >> + if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
> > > >> + || !node_isset(slab_nid(slab), slab_nodes))
> > > >
> > > > I think you intended !node_isset(numa_mem_id(), slab_nodes)?
> > > >
> > > > "Skip freeing to pcs if it's remote free, but memoryless nodes is
> > > > an exception".
> > >
> > > Indeed, thanks! Ming, could you retry with that fixed up please?
> >
> > After applying the following change, IOPS is ~25M:
> >
> > - delta change on the two patches
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 085fe49eec68..56fe8bd956c0 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -6142,7 +6142,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
> > return;
> >
> > if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
> > - || !node_isset(slab_nid(slab), slab_nodes))
> > + || !node_isset(numa_mem_id(), slab_nodes))
> > && likely(!slab_test_pfmemalloc(slab))) {
> > if (likely(free_to_pcs(s, object, true)))
> > return;
> >
>
> Hi Ming, thanks a lot for helping testing!
>
> The stats look quite fine to me, but we're still seeing suboptimal IOPS.
>
> > - slab stat on patched `815c8e35511d Merge branch 'slab/for-7.0/sheaves' into slab/for-next`
>
> Does that doesn't include Vlastimil's (fb1091febd66 mm/slab: allow sheaf
> refill if blocking is not allowed)?
No, because fb1091febd66 isn't included into `815c8e35511d Merge branch
'slab/for-7.0/sheaves'.
>
> Next time when testing it, could you please test on top of 7.0-rc3 w/
> the memoryless node patch (w/ the delta above) applied?
IOPS is same between `815c8e35511d Merge branch 'slab/for-7.0/sheaves' into slab/for-next`
and 7.0-rc3 with the two patches.
IMO, it should be more easier to compare & investigate by focusing on
815c8e35511d, given there is only 41 patches between v6.19-rc5 and
commit 815c8e35511d.
>
> Also, let us check a few things...
>
> 1) Does bumping up sheaf capacity change the slab stats & IOPS?
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 0c906fefc31b..5207279417e2 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -7611,13 +7611,13 @@ static unsigned int calculate_sheaf_capacity(struct kmem_cache *s,
> * should result in similar lock contention (barn or list_lock)
> */
> if (s->size >= PAGE_SIZE)
> - capacity = 4;
> + capacity = 6;
> else if (s->size >= 1024)
> - capacity = 12;
> + capacity = 24;
> else if (s->size >= 256)
> - capacity = 26;
> + capacity = 52;
> else
> - capacity = 60;
> + capacity = 120;
>
> /* Increment capacity to make sheaf exactly a kmalloc size bucket */
> size = struct_size_t(struct slab_sheaf, objects, capacity);
IOPS can be increased from 24M to 29M with this patch, against 7.0-rc3 with
Vlastimil's today patchset.
>
> 2) Is there any change in NUMA locality between v6.19 vs. v7.0-rc3 (patched)?
> (e.g., measured via
> perf stat -e node-loads,node-load-misses,node-stores,node-store-misses)
root@tomsrv:~/temp/mm/7.0-rc3/patched# perf stat -a -e node-loads,node-load-misses,node-stores,node-store-misses
Error:
No supported events found.
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (node-loads).
"dmesg | grep -i perf" may provide additional information.
Looks the events are not supported on AMD Zen4 machine.
>
> 3) It's quite strange that blk_mq_sched_bio_merge() completely
> disappeared in v7.0-rc2 profile [1] . Is there any change
> in read/write io merge rate? (/proc/diskstats) between v6.19 and
> v7.0-rc3?
It isn't strange.
Because IOPS drops to 13M on v7.0-rc2 from 34M on v6.19-rc5, so blk_mq_sched_bio_merge
can't be shown obviously, which code path is run for each bio(IO).
It is one totally random READ IO, and IO merge shouldn't happen.
Thanks,
Ming