[PATCH v6 00/33] SLUB: reduce irq disabled scope and make it RT compatible

From: Vlastimil Babka
Date: Sat Sep 04 2021 - 06:50:18 EST


(Linus and Stephen only To:'d on the cover letter for the git part)

Hi,

this new version addresses feedback from Linus wrt use of conditional locking.
The flush_slab() one is fixed as was discussed, per Mike Galbraith's
suggestion [14]. More instances in the series occuring in cmpxchg_double_slab()
and slab_[un]lock() were removed too. As such, the cleanest way to do this
without churn was to rebase, drop some patches and adjust others (see changelog
below). As there was rebase anyway, bumped the base tag from v5.14-rc6 to
v5.14 as well. The resulting differences from v5 should be only how the code is
organized, not functional. I will of course continue testing, as I hope Mel and
the RT guys would soon too.

As Linus mentioned the possibility of taking this directly from git [15] I have
created a signed tag for a later (near end of merge window?) pull request
assuming nothing blows up:
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git tags/mm-slub-5.15-rc1

I assume that would mean dropping the series from mmotm now and instead adding
the tree to linux-next? AFAIK there are no further slub patches pending anywhere
so no conflicts.

Changes since v5 [11]
* Rebased from v5.14-rc6 to v5.14
* Removed conditional locking from flush_slab(), instead the non-locking variant
is moved to its only caller (patches 27-28)
* In the same vein, removed patches "mm, slub: unify cmpxchg_double_slab() and
__cmpxchg_double_slab()" [12] and "mm, slub: optionally save/restore irqs in
slab_[un]lock()" [13] that introduced conditional locking elsewhere. Patch 30
"mm, slub: make slab_lock() disable irqs with PREEMPT_RT" adjusted
accordingly

Links to rarlier versions (changelog trimmed):
v4 [10], v3 [8], v2 [5], RFC v1 [1]:

This series was initially inspired by Mel's pcplist local_lock rewrite, and
also interest to better understand SLUB's locking and the new primitives and RT
variants and implications. It makes SLUB compatible with PREEMPT_RT and
generally more preemption-friendly, apparently without significant regressions,
as the fast paths are not affected.

The main changes to SLUB by this series:

* irq disabling is now only done for minimum amount of time needed to protect
the strict kmem_cache_cpu fields, and as part of spin lock, local lock and
bit lock operations to make them irq-safe

* SLUB is fully PREEMPT_RT compatible

Series is based on 5.14 and also available as a git branch:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v6r2

And signed tag:
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git tags/mm-slub-5.15-rc1

The series should now be sufficiently tested in both RT and !RT configs, mainly
thanks to Mike.

The RFC/v1 version also got basic performance screening by Mel that didn't show
major regressions. Mike's testing with hackbench of v2 on !RT reported
negligible differences [6]:

virgin(ish) tip
5.13.0.g60ab3ed-tip
7,320.67 msec task-clock # 7.792 CPUs utilized ( +- 0.31% )
221,215 context-switches # 0.030 M/sec ( +- 3.97% )
16,234 cpu-migrations # 0.002 M/sec ( +- 4.07% )
13,233 page-faults # 0.002 M/sec ( +- 0.91% )
27,592,205,252 cycles # 3.769 GHz ( +- 0.32% )
8,309,495,040 instructions # 0.30 insn per cycle ( +- 0.37% )
1,555,210,607 branches # 212.441 M/sec ( +- 0.42% )
5,484,209 branch-misses # 0.35% of all branches ( +- 2.13% )

0.93949 +- 0.00423 seconds time elapsed ( +- 0.45% )
0.94608 +- 0.00384 seconds time elapsed ( +- 0.41% ) (repeat)
0.94422 +- 0.00410 seconds time elapsed ( +- 0.43% )

5.13.0.g60ab3ed-tip +slub-local-lock-v2r3
7,343.57 msec task-clock # 7.776 CPUs utilized ( +- 0.44% )
223,044 context-switches # 0.030 M/sec ( +- 3.02% )
16,057 cpu-migrations # 0.002 M/sec ( +- 4.03% )
13,164 page-faults # 0.002 M/sec ( +- 0.97% )
27,684,906,017 cycles # 3.770 GHz ( +- 0.45% )
8,323,273,871 instructions # 0.30 insn per cycle ( +- 0.28% )
1,556,106,680 branches # 211.901 M/sec ( +- 0.31% )
5,463,468 branch-misses # 0.35% of all branches ( +- 1.33% )

0.94440 +- 0.00352 seconds time elapsed ( +- 0.37% )
0.94830 +- 0.00228 seconds time elapsed ( +- 0.24% ) (repeat)
0.93813 +- 0.00440 seconds time elapsed ( +- 0.47% ) (repeat)

RT configs showed some throughput regressions, but that's expected tradeoff for
the preemption improvements through the RT mutex. It didn't prevent the v2 to
be incorporated to the 5.13 RT tree [7], leading to testing exposure and
bugfixes.

Before the series, SLUB is lockless in both allocation and free fast paths, but
elsewhere, it's disabling irqs for considerable periods of time - especially in
allocation slowpath and the bulk allocation, where IRQs are re-enabled only
when a new page from the page allocator is needed, and the context allows
blocking. The irq disabled sections can then include deactivate_slab() which
walks a full freelist and frees the slab back to page allocator or
unfreeze_partials() going through a list of percpu partial slabs. The RT tree
currently has some patches mitigating these, but we can do much better in
mainline too.

Patches 1-6 are straightforward improvements or cleanups that could exist
outside of this series too, but are prerequsities.

Patches 7-9 are also preparatory code changes without functional changes, but
not so useful without the rest of the series.

Patch 10 simplifies the fast paths on systems with preemption, based on
(hopefully correct) observation that the current loops to verify tid are
unnecessary.

Patches 11-20 focus on reducing irq disabled scope in the allocation slowpath.

Patch 11 moves disabling of irqs into ___slab_alloc() from its callers, which
are the allocation slowpath, and bulk allocation. Instead these callers only
disable preemption to stabilize the cpu. The following patches then gradually
reduce the scope of disabled irqs in ___slab_alloc() and the functions called
from there. As of patch 14, the re-enabling of irqs based on gfp flags before
calling the page allocator is removed from allocate_slab(). As of patch 17,
it's possible to reach the page allocator (in case of existing slabs depleted)
without disabling and re-enabling irqs a single time.

Pathces 21-26 reduce the scope of disabled irqs in functions related to
unfreezing percpu partial slab.

Patch 27 is preparatory. Patch 28 is adopted from the RT tree and converts the
flushing of percpu slabs on all cpus from using IPI to workqueue, so that the
processing isn't happening with irqs disabled in the IPI handler. The flushing
is not performance critical so it should be acceptable.

Patch 29 also comes from RT tree and makes object_map_lock RT compatible.

Patch 30 make slab_lock irq-safe on RT where we cannot rely on having
irq disabled from the list_lock spin lock usage.

Patch 31 changes kmem_cache_cpu->partial handling in put_cpu_partial() from
cmpxchg loop to a short irq disabled section, which is used by all other code
modifying the field. This addresses a theoretical race scenario pointed out by
Jann, and makes the critical section safe wrt with RT local_lock semantics
after the conversion in patch 35.

Patch 32 changes preempt disable to migrate disable, so that the nested
list_lock spinlock is safe to take on RT. Because migrate_disable() is a
function call even on !RT, a small set of private wrappers is introduced
to keep using the cheaper preempt_disable() on !PREEMPT_RT configurations.
As of this patch, SLUB should be already compatible with RT's lock semantics.

Finally, patch 33 changes irq disabled sections that protect kmem_cache_cpu
fields in the slow paths, with a local lock. However on PREEMPT_RT it means the
lockless fast paths can now preempt slow paths which don't expect that, so the
local lock has to be taken also in the fast paths and they are no longer
lockless. RT folks seem to not mind this tradeoff. The patch also updates the
locking documentation in the file's comment.

[1] https://lore.kernel.org/lkml/20210524233946.20352-1-vbabka@xxxxxxx/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0001-mm-sl-au-b-Change-list_lock-to-raw_spinlock_t.patch?h=linux-5.12.y-rt-patches
[3] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0004-mm-slub-Move-discard_slab-invocations-out-of-IRQ-off.patch?h=linux-5.12.y-rt-patches
[4] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0005-mm-slub-Move-flush_cpu_slab-invocations-__free_slab-.patch?h=linux-5.12.y-rt-patches
[5] https://lore.kernel.org/lkml/20210609113903.1421-1-vbabka@xxxxxxx/
[6] https://lore.kernel.org/lkml/891dc24e38106f8542f4c72831d52dc1a1863ae8.camel@xxxxxx
[7] https://lore.kernel.org/linux-rt-users/87tul5p2fa.ffs@xxxxxxxxxxxxxxxxxxxxxxx/
[8] https://lore.kernel.org/lkml/20210729132132.19691-1-vbabka@xxxxxxx/
[9] https://lore.kernel.org/lkml/20210804120522.GD6464@xxxxxxxxxxxxxxxxxxx/
[10] https://lore.kernel.org/lkml/20210805152000.12817-1-vbabka@xxxxxxx/
[11] https://lore.kernel.org/all/20210823145826.3857-1-vbabka@xxxxxxx/
[12] https://lore.kernel.org/all/20210823145826.3857-7-vbabka@xxxxxxx/
[13] https://lore.kernel.org/all/20210823145826.3857-32-vbabka@xxxxxxx/
[14] https://lore.kernel.org/linux-mm/1ae902f7-c500-f9e8-1b4f-077beade0f42@xxxxxxx/
[15] https://lore.kernel.org/linux-mm/CAHk-=wjRfFtnQ5p42s_5Uv8i0U5YKSBpTH++_ZMKZyyvYicYmQ@xxxxxxxxxxxxxx/
[16] https://lore.kernel.org/all/871r6j526m.ffs@tglx/

Sebastian Andrzej Siewior (2):
mm: slub: move flush_cpu_slab() invocations __free_slab() invocations
out of IRQ context
mm: slub: make object_map_lock a raw_spinlock_t

Vlastimil Babka (31):
mm, slub: don't call flush_all() from slab_debug_trace_open()
mm, slub: allocate private object map for debugfs listings
mm, slub: allocate private object map for validate_slab_cache()
mm, slub: don't disable irq for debug_check_no_locks_freed()
mm, slub: remove redundant unfreeze_partials() from put_cpu_partial()
mm, slub: extract get_partial() from new_slab_objects()
mm, slub: dissolve new_slab_objects() into ___slab_alloc()
mm, slub: return slab page from get_partial() and set c->page
afterwards
mm, slub: restructure new page checks in ___slab_alloc()
mm, slub: simplify kmem_cache_cpu and tid setup
mm, slub: move disabling/enabling irqs to ___slab_alloc()
mm, slub: do initial checks in ___slab_alloc() with irqs enabled
mm, slub: move disabling irqs closer to get_partial() in
___slab_alloc()
mm, slub: restore irqs around calling new_slab()
mm, slub: validate slab from partial list or page allocator before
making it cpu slab
mm, slub: check new pages with restored irqs
mm, slub: stop disabling irqs around get_partial()
mm, slub: move reset of c->page and freelist out of deactivate_slab()
mm, slub: make locking in deactivate_slab() irq-safe
mm, slub: call deactivate_slab() without disabling irqs
mm, slub: move irq control into unfreeze_partials()
mm, slub: discard slabs in unfreeze_partials() without irqs disabled
mm, slub: detach whole partial list at once in unfreeze_partials()
mm, slub: separate detaching of partial list in unfreeze_partials()
from unfreezing
mm, slub: only disable irq with spin_lock in __unfreeze_partials()
mm, slub: don't disable irqs in slub_cpu_dead()
mm, slab: split out the cpu offline variant of flush_slab()
mm, slub: make slab_lock() disable irqs with PREEMPT_RT
mm, slub: protect put_cpu_partial() with disabled irqs instead of
cmpxchg
mm, slub: use migrate_disable() on PREEMPT_RT
mm, slub: convert kmem_cpu_slab protection to local_lock

include/linux/page-flags.h | 9 +
include/linux/slub_def.h | 6 +
mm/slab_common.c | 2 +
mm/slub.c | 797 +++++++++++++++++++++++++------------
4 files changed, 563 insertions(+), 251 deletions(-)

--
2.33.0