Re: [PATCH v4 08/22] slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock()

From: D, Suneeth

Date: Mon Mar 02 2026 - 07:01:04 EST

Hi Vlastimil Babka,

On 1/23/2026 12:22 PM, Vlastimil Babka wrote:

Before we enable percpu sheaves for kmalloc caches, we need to make sure
kmalloc_nolock() and kfree_nolock() will continue working properly and
not spin when not allowed to.

Percpu sheaves themselves use local_trylock() so they are already
compatible. We just need to be careful with the barn->lock spin_lock.
Pass a new allow_spin parameter where necessary to use
spin_trylock_irqsave().

In kmalloc_nolock_noprof() we can now attempt alloc_from_pcs() safely,
for now it will always fail until we enable sheaves for kmalloc caches
next. Similarly in kfree_nolock() we can attempt free_to_pcs().

We run will-it-scale micro-benchmark as part of our weekly CI for Kernel Performance Regression testing between a stable vs rc kernel. We observed will-it-scale-thread-page_fault3 variant was regressing with 9-11% on AMD platforms (Turin and Bergamo)between the kernels v6.19 and v7.0-rc1. Bisecting further landed me onto this commit
f1427a1d64156bb88d84f364855c364af6f67a3b (slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock()) as the first bad commit. The following were the machines' configuration and test parameters used:-

Model name: AMD EPYC 128-Core Processor [Bergamo]
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 1
Total online memory: 256G

Model name: AMD EPYC 64-Core Processor [Turin]
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
Total online memory: 258G

Test params:
------------
nr_task: [1 8 64 128 192 256]
mode: thread
test: page_fault3
kpi: per_thread_ops
cpufreq_governor: performance

The following are the stats after bisection:-
(the KPI used here is per_thread_ops)

kernel_versions per_thread_ops
--------------- ---------------
v6.19.0 (baseline) - 2410188
v7.0-rc1 - 2151474
v6.19-rc5-f1427a1d6415 - 2263974
v6.19-rc5-f3421f8d154c (one commit before culprit) - 2323263

Recreation steps:
-----------------
1) git clone https://github.com/antonblanchard/will-it-scale.git
2) git clone https://github.com/intel/lkp-tests.git
3) cd will-it-scale && git apply
lkp-tests/programs/will-it-scale/pkg/will-it-scale.patch
4) make
5) python3 runtest.py page_fault3 25 thread 0 0 1 8 64 128 192 256

NOTE: [5] is specific to machine's architecture. starting from 1 is the
array of no.of tasks that you'd wish to run the testcase which here is
no.cores per CCX, per NUMA node/ per Socket, nr_threads.

I also ran the micro-benchmark with ./tools/testing/perf record and
following is the diff collected:-

# ./perf diff perf.data.old perf.data
Warning:
4 out of order events recorded.
# Event 'cpu/cycles/P'
#
# Baseline Delta Abs Shared Object Symbol
# ........ ......... ..................... ...................................................
#
+11.95% [kernel.kallsyms] [k] folio_pte_batch
+10.30% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
+9.91% [kernel.kallsyms] [k] __block_write_begin_int
0.00% +8.56% [kernel.kallsyms] [k] clear_page_erms
7.71% -7.71% [kernel.kallsyms] [k] delay_halt
+6.84% [kernel.kallsyms] [k] block_dirty_folio
1.58% +4.90% [kernel.kallsyms] [k] unmap_page_range
0.00% +4.78% [kernel.kallsyms] [k] folio_remove_rmap_ptes
3.17% -3.17% [kernel.kallsyms] [k] __vmf_anon_prepare
0.00% +3.09% [kernel.kallsyms] [k] ext4_page_mkwrite
+2.32% [kernel.kallsyms] [k] ext4_dirty_folio
0.00% +2.01% [kernel.kallsyms] [k] vm_normal_page
0.00% +1.93% [kernel.kallsyms] [k] set_pte_range
+1.84% [kernel.kallsyms] [k] block_commit_write
+1.82% [kernel.kallsyms] [k] mod_node_page_state
+1.68% [kernel.kallsyms] [k] lruvec_stat_mod_folio
+1.56% [kernel.kallsyms] [k] mod_memcg_lruvec_state
1.40% -1.39% [kernel.kallsyms] [k] mod_memcg_state
+1.38% [kernel.kallsyms] [k] folio_add_file_rmap_ptes
5.01% -0.87% page_fault3_threads [.] testcase
+0.84% [kernel.kallsyms] [k] tlb_flush_rmap_batch
+0.83% [kernel.kallsyms] [k] mark_buffer_dirty
1.66% -0.75% [kernel.kallsyms] [k] flush_tlb_mm_range
+0.72% [kernel.kallsyms] [k] css_rstat_updated
0.60% -0.60% [kernel.kallsyms] [k] osq_unlock
+0.57% [kernel.kallsyms] [k] _raw_spin_unlock
+0.55% [kernel.kallsyms] [k] perf_iterate_ctx
+0.54% [kernel.kallsyms] [k] __rcu_read_lock
0.11% +0.53% [kernel.kallsyms] [k] osq_lock
+0.46% [kernel.kallsyms] [k] finish_fault
0.46% -0.46% [kernel.kallsyms] [k] do_wp_page
+0.45% [kernel.kallsyms] [k] pte_val
1.10% -0.41% [kernel.kallsyms] [k] filemap_fault
+0.39% [kernel.kallsyms] [k] native_set_pte
+0.36% [kernel.kallsyms] [k] rwsem_spin_on_owner
0.28% -0.28% [kernel.kallsyms] [k] mas_topiary_replace
+0.28% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
+0.27% [kernel.kallsyms] [k] percpu_counter_add_batch
+0.27% [kernel.kallsyms] [k] memset
0.00% +0.24% [kernel.kallsyms] [k] mas_walk
0.23% -0.23% [kernel.kallsyms] [k] __pmd_alloc
0.23% -0.22% [kernel.kallsyms] [k] rcu_core
+0.21% [kernel.kallsyms] [k] __rcu_read_unlock
0.04% +0.19% [kernel.kallsyms] [k] ext4_da_get_block_prep
+0.19% [kernel.kallsyms] [k] lock_vma_under_rcu
0.01% +0.19% [kernel.kallsyms] [k] prep_compound_page
+0.18% [kernel.kallsyms] [k] filemap_get_entry
+0.17% [kernel.kallsyms] [k] folio_mark_dirty

Would be happy to help with further testing and providing additional data if required.

Thanks,
Suneeth D

Reviewed-by: Suren Baghdasaryan <surenb@xxxxxxxxxx>
Reviewed-by: Harry Yoo <harry.yoo@xxxxxxxxxx>
Reviewed-by: Hao Li <hao.li@xxxxxxxxx>
Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx>
---
mm/slub.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 60 insertions(+), 22 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 41e1bf35707c..4ca6bd944854 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2889,7 +2889,8 @@ static void pcs_destroy(struct kmem_cache *s)
s->cpu_sheaves = NULL;
}
-static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
+static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
+ bool allow_spin)
{
struct slab_sheaf *empty = NULL;
unsigned long flags;
@@ -2897,7 +2898,10 @@ static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
if (!data_race(barn->nr_empty))
return NULL;
- spin_lock_irqsave(&barn->lock, flags);
+ if (likely(allow_spin))
+ spin_lock_irqsave(&barn->lock, flags);
+ else if (!spin_trylock_irqsave(&barn->lock, flags))
+ return NULL;
if (likely(barn->nr_empty)) {
empty = list_first_entry(&barn->sheaves_empty,
@@ -2974,7 +2978,8 @@ static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
* change.
*/
static struct slab_sheaf *
-barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
+barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty,
+ bool allow_spin)
{
struct slab_sheaf *full = NULL;
unsigned long flags;
@@ -2982,7 +2987,10 @@ barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
if (!data_race(barn->nr_full))
return NULL;
- spin_lock_irqsave(&barn->lock, flags);
+ if (likely(allow_spin))
+ spin_lock_irqsave(&barn->lock, flags);
+ else if (!spin_trylock_irqsave(&barn->lock, flags))
+ return NULL;
if (likely(barn->nr_full)) {
full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
@@ -3003,7 +3011,8 @@ barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
* barn. But if there are too many full sheaves, reject this with -E2BIG.
*/
static struct slab_sheaf *
-barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
+barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full,
+ bool allow_spin)
{
struct slab_sheaf *empty;
unsigned long flags;
@@ -3014,7 +3023,10 @@ barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
if (!data_race(barn->nr_empty))
return ERR_PTR(-ENOMEM);
- spin_lock_irqsave(&barn->lock, flags);
+ if (likely(allow_spin))
+ spin_lock_irqsave(&barn->lock, flags);
+ else if (!spin_trylock_irqsave(&barn->lock, flags))
+ return ERR_PTR(-EBUSY);
if (likely(barn->nr_empty)) {
empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
@@ -5008,7 +5020,8 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
return NULL;
}
- full = barn_replace_empty_sheaf(barn, pcs->main);
+ full = barn_replace_empty_sheaf(barn, pcs->main,
+ gfpflags_allow_spinning(gfp));
if (full) {
stat(s, BARN_GET);
@@ -5025,7 +5038,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
empty = pcs->spare;
pcs->spare = NULL;
} else {
- empty = barn_get_empty_sheaf(barn);
+ empty = barn_get_empty_sheaf(barn, true);
}
}
@@ -5165,7 +5178,8 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
}
static __fastpath_inline
-unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
+unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, gfp_t gfp, size_t size,
+ void **p)
{
struct slub_percpu_sheaves *pcs;
struct slab_sheaf *main;
@@ -5199,7 +5213,8 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
return allocated;
}
- full = barn_replace_empty_sheaf(barn, pcs->main);
+ full = barn_replace_empty_sheaf(barn, pcs->main,
+ gfpflags_allow_spinning(gfp));
if (full) {
stat(s, BARN_GET);
@@ -5700,7 +5715,7 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
struct kmem_cache *s;
bool can_retry = true;
- void *ret = ERR_PTR(-EBUSY);
+ void *ret;
VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO |
__GFP_NO_OBJ_EXT));
@@ -5731,6 +5746,12 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
*/
return NULL;
+ ret = alloc_from_pcs(s, alloc_gfp, node);
+ if (ret)
+ goto success;
+
+ ret = ERR_PTR(-EBUSY);
+
/*
* Do not call slab_alloc_node(), since trylock mode isn't
* compatible with slab_pre_alloc_hook/should_failslab and
@@ -5767,6 +5788,7 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
ret = NULL;
}
+success:
maybe_wipe_obj_freeptr(s, ret);
slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret,
slab_want_init_on_alloc(alloc_gfp, s), size);
@@ -6087,7 +6109,8 @@ static void __pcs_install_empty_sheaf(struct kmem_cache *s,
* unlocked.
*/
static struct slub_percpu_sheaves *
-__pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
+__pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
+ bool allow_spin)
{
struct slab_sheaf *empty;
struct node_barn *barn;
@@ -6111,7 +6134,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
put_fail = false;
if (!pcs->spare) {
- empty = barn_get_empty_sheaf(barn);
+ empty = barn_get_empty_sheaf(barn, allow_spin);
if (empty) {
pcs->spare = pcs->main;
pcs->main = empty;
@@ -6125,7 +6148,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
return pcs;
}
- empty = barn_replace_full_sheaf(barn, pcs->main);
+ empty = barn_replace_full_sheaf(barn, pcs->main, allow_spin);
if (!IS_ERR(empty)) {
stat(s, BARN_PUT);
@@ -6133,7 +6156,8 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
return pcs;
}
- if (PTR_ERR(empty) == -E2BIG) {
+ /* sheaf_flush_unused() doesn't support !allow_spin */
+ if (PTR_ERR(empty) == -E2BIG && allow_spin) {
/* Since we got here, spare exists and is full */
struct slab_sheaf *to_flush = pcs->spare;
@@ -6158,6 +6182,14 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
alloc_empty:
local_unlock(&s->cpu_sheaves->lock);
+ /*
+ * alloc_empty_sheaf() doesn't support !allow_spin and it's
+ * easier to fall back to freeing directly without sheaves
+ * than add the support (and to sheaf_flush_unused() above)
+ */
+ if (!allow_spin)
+ return NULL;
+
empty = alloc_empty_sheaf(s, GFP_NOWAIT);
if (empty)
goto got_empty;
@@ -6200,7 +6232,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
* The object is expected to have passed slab_free_hook() already.
*/
static __fastpath_inline
-bool free_to_pcs(struct kmem_cache *s, void *object)
+bool free_to_pcs(struct kmem_cache *s, void *object, bool allow_spin)
{
struct slub_percpu_sheaves *pcs;
@@ -6211,7 +6243,7 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
if (unlikely(pcs->main->size == s->sheaf_capacity)) {
- pcs = __pcs_replace_full_main(s, pcs);
+ pcs = __pcs_replace_full_main(s, pcs, allow_spin);
if (unlikely(!pcs))
return false;
}
@@ -6333,7 +6365,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
goto fail;
}
- empty = barn_get_empty_sheaf(barn);
+ empty = barn_get_empty_sheaf(barn, true);
if (empty) {
pcs->rcu_free = empty;
@@ -6453,7 +6485,7 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
goto no_empty;
if (!pcs->spare) {
- empty = barn_get_empty_sheaf(barn);
+ empty = barn_get_empty_sheaf(barn, true);
if (!empty)
goto no_empty;
@@ -6467,7 +6499,7 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
goto do_free;
}
- empty = barn_replace_full_sheaf(barn, pcs->main);
+ empty = barn_replace_full_sheaf(barn, pcs->main, true);
if (IS_ERR(empty)) {
stat(s, BARN_PUT_FAIL);
goto no_empty;
@@ -6719,7 +6751,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
&& likely(!slab_test_pfmemalloc(slab))) {
- if (likely(free_to_pcs(s, object)))
+ if (likely(free_to_pcs(s, object, true)))
return;
}
@@ -6980,6 +7012,12 @@ void kfree_nolock(const void *object)
* since kasan quarantine takes locks and not supported from NMI.
*/
kasan_slab_free(s, x, false, false, /* skip quarantine */true);
+
+ if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())) {
+ if (likely(free_to_pcs(s, x, false)))
+ return;
+ }
+
do_slab_free(s, slab, x, x, 0, _RET_IP_);
}
EXPORT_SYMBOL_GPL(kfree_nolock);
@@ -7532,7 +7570,7 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
size--;
}
- i = alloc_from_pcs_bulk(s, size, p);
+ i = alloc_from_pcs_bulk(s, flags, size, p);
if (i < size) {
/*