Re: [PATCH v7] mm/slub: defer freelist construction until after bulk allocation from a new slab

From: Vlastimil Babka (SUSE)

Date: Fri Apr 17 2026 - 06:57:03 EST

On 4/15/26 10:52, hu.shengming@xxxxxxxxxx wrote:
> From: Shengming Hu <hu.shengming@xxxxxxxxxx>
>
> Allocations from a fresh slab can consume all of its objects, and the
> freelist built during slab allocation is discarded immediately as a result.
>
> Instead of special-casing the whole-slab bulk refill case, defer freelist
> construction until after objects are emitted from a fresh slab.
> new_slab() now only allocates the slab and initializes its metadata.
> refill_objects() then obtains a fresh slab and lets alloc_from_new_slab()
> emit objects directly, building a freelist only for the objects left
> unallocated; the same change is applied to alloc_single_from_new_slab().
>
> To keep CONFIG_SLAB_FREELIST_RANDOM=y/n on the same path, introduce a
> small iterator abstraction for walking free objects in allocation order.
> The iterator is used both for filling the sheaf and for building the
> freelist of the remaining objects.
>
> Also mark setup_object() inline. After this optimization, the compiler no
> longer consistently inlines this helper in the hot path, which can hurt
> performance. Explicitly marking it inline restores the expected code
> generation.
>
> This reduces per-object overhead when allocating from a fresh slab.
> The most direct benefit is in the paths that allocate objects first and
> only build a freelist for the remainder afterward: bulk allocation from
> a new slab in refill_objects(), single-object allocation from a new slab
> in ___slab_alloc(), and the corresponding early-boot paths that now use
> the same deferred-freelist scheme. Since refill_objects() is also used to
> refill sheaves, the optimization is not limited to the small set of
> kmem_cache_alloc_bulk()/kmem_cache_free_bulk() users; regular allocation
> workloads may benefit as well when they refill from a fresh slab.
>
> In slub_bulk_bench, the time per object drops by about 32% to 70% with
> CONFIG_SLAB_FREELIST_RANDOM=n, and by about 58% to 70% with
> CONFIG_SLAB_FREELIST_RANDOM=y. This benchmark is intended to isolate the
> cost removed by this change: each iteration allocates exactly
> slab->objects from a fresh slab. That makes it a near best-case scenario
> for deferred freelist construction, because the old path still built a
> full freelist even when no objects remained, while the new path avoids
> that work. Realistic workloads may see smaller end-to-end gains depending
> on how often allocations reach this fresh-slab refill path.
>
> Benchmark results (slub_bulk_bench):
> Machine: qemu-system-x86 -m 1024M -smp 8 -enable-kvm -cpu host
> Kernel: Linux 7.0.0-rc7-next-20260407
> Config: x86_64_defconfig
> Cpu: 0
> Rounds: 20
> Total: 256MB
>
> - CONFIG_SLAB_FREELIST_RANDOM=n -
>
> obj_size=16, batch=256:
> before: 4.85 +- 0.08 ns/object
> after: 3.30 +- 0.20 ns/object
> delta: -31.9%
>
> obj_size=32, batch=128:
> before: 6.89 +- 0.07 ns/object
> after: 3.74 +- 0.06 ns/object
> delta: -45.7%
>
> obj_size=64, batch=64:
> before: 10.70 +- 0.17 ns/object
> after: 4.60 +- 0.12 ns/object
> delta: -57.0%
>
> obj_size=128, batch=32:
> before: 18.69 +- 0.26 ns/object
> after: 6.54 +- 1.30 ns/object
> delta: -65.0%
>
> obj_size=256, batch=32:
> before: 22.36 +- 0.24 ns/object
> after: 6.61 +- 0.09 ns/object
> delta: -70.5%
>
> obj_size=512, batch=32:
> before: 20.59 +- 0.36 ns/object
> after: 6.90 +- 0.15 ns/object
> delta: -66.5%
>
> - CONFIG_SLAB_FREELIST_RANDOM=y -
>
> obj_size=16, batch=256:
> before: 8.77 +- 0.11 ns/object
> after: 3.63 +- 0.09 ns/object
> delta: -58.6%
>
> obj_size=32, batch=128:
> before: 11.59 +- 0.31 ns/object
> after: 4.24 +- 0.12 ns/object
> delta: -63.4%
>
> obj_size=64, batch=64:
> before: 15.58 +- 0.51 ns/object
> after: 5.32 +- 0.11 ns/object
> delta: -65.9%
>
> obj_size=128, batch=32:
> before: 22.13 +- 0.63 ns/object
> after: 7.39 +- 0.20 ns/object
> delta: -66.6%
>
> obj_size=256, batch=32:
> before: 27.12 +- 0.74 ns/object
> after: 7.92 +- 0.08 ns/object
> delta: -70.8%
>
> obj_size=512, batch=32:
> before: 26.92 +- 0.32 ns/object
> after: 8.28 +- 0.26 ns/object
> delta: -69.2%
>
> Link: https://github.com/HSM6236/slub_bulk_test.git
> Suggested-by: Harry Yoo (Oracle) <harry@xxxxxxxxxx>
> Reviewed-by: Harry Yoo (Oracle) <harry@xxxxxxxxxx>
> Reviewed-by: Hao Li <hao.li@xxxxxxxxx>
> Tested-by: Hao Li <hao.li@xxxxxxxxx>
> Signed-off-by: Shengming Hu <hu.shengming@xxxxxxxxxx>

Thanks, LGTM. Will pick up to slab/for-next after 7.1-rc1 is released.