[PATCH RFC 0/6] mm, slub: handle pending kfree_rcu() in kmem_cache_destroy()

From: Vlastimil Babka
Date: Mon Jul 15 2024 - 16:30:08 EST


First RFC, feel free to ignore for now if too busy with merge window.
Also in git:
https://git.kernel.org/vbabka/l/slab-kfree_rcu-destroy-v1r0

Based on slab/for-next:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-next

Since SLOB was removed, we have allowed kfree_rcu() for objects
allocated from any kmem_cache in addition to kmalloc().

Recently we have attempted to replace existing call_rcu() usage with
kfree_rcu() where the callback is a plain kmem_cache_free(), in a series
by Julia Lawall [1].

Jakub Kicinski pointed out [2] this was tried already in batman-adv but
had to be reverted due to kmem_cache_destroy() failing due to objects
remaining in the cache, despite rcu_barrier() being used.

Jason Donenfeld found the culprit [3] being a35d16905efc ("rcu: Add
basic support for kfree_rcu() batching") causing rcu_barrier() to be
insufficient.

This was never a problem for kfree_rcu() usage on kmalloc() objects as
the kmalloc caches are never destroyed, but arbitrary caches can be,
e.g. due to module unload.

Out of the possible solutions collected by Paul McKenney [4] the most
appealing to me is "kmem_cache_destroy() lingers for kfree_rcu()" as
it adds no additional concerns to kfree_rcu() users. We already have
the precedence in some parts of the kmem_cache cleanup being done
asynchronously for SLAB_TYPESAFE_BY_RCU caches.

This series implements the necessary changes to the slab allocator,
mainly that if there are objects remaining in the cache when
kmem_cache_destroy() is called it is assumed that it's due to pending
kfree_rcu(), and a asynchronous work is scheduled that performs the
necessary barrier and then tries again. If objects remain after the
barrier, the usual warning is reported and the cache remains
undestroyed.

Notably the proper barrier doesn't yet exist so hopefully Paul or the
RCU team can help here :) it should be able to make it pessimistic as it
won't hold up anything but the work item.

Some downsides of this approach for debugging exist but should be small
enough:

- we can no longer report the stack trace leading to a premature
kmem_cache_destroy(), but arguably that's not that interesting as the
allocation traces and other details about the remaining objects, which
are still reported, just a bit later.
- objects that are freed after kmem_cache_destroy() but before the work
item proceeds with the destroy after the barrier, are technically
bugs, but we won't be able to catch them unless we add some checks into
the freeing hotpaths. It's not worth it, IMHO.

There is also a bunch of preliminary steps. The potentially visible one
is that sysfs and debugfs directories are now removed immediately during
kmem_cache_destroy() - previously this would be delayed for
SLAB_TYPESAFE_BY_RCU caches or left around forever if remaining objects
were detected. The extra delay by asynchronous destroy due to
kfree_rcu() could mean that a module unload/load cycle could create a
new instance of the cache which would fail to create these directories -
a concern raised by Paul. The immediate removal is the simplest solution
(compared to e.g. renaming the directories) and should not make debugging
harder - while it won't be possible to check debugfs for allocation
traces of leaked objects, they are listed with more detail in dmesg
anyway.

[1] https://lore.kernel.org/all/20240609082726.32742-1-Julia.Lawall@xxxxxxxx/
[2] https://lore.kernel.org/all/20240612143305.451abf58@xxxxxxxxxx/
[3] https://lore.kernel.org/all/Zmo9-YGraiCj5-MI@xxxxxxxxx/
[4] https://docs.google.com/document/d/1v0rcZLvvjVGejT3523W0rDy_sLFu2LWc_NR3fQItZaA/edit

To: Paul E. McKenney <paulmck@xxxxxxxxxx>
To: Joel Fernandes <joel@xxxxxxxxxxxxxxxxx>
To: Josh Triplett <josh@xxxxxxxxxxxxxxxx>
To: Boqun Feng <boqun.feng@xxxxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
CC: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Cc: Lai Jiangshan <jiangshanlai@xxxxxxxxx>
Cc: Zqiang <qiang.zhang1211@xxxxxxxxx>
Cc: Julia Lawall <Julia.Lawall@xxxxxxxx>
Cc: Jakub Kicinski <kuba@xxxxxxxxxx>
Cc: Jason A. Donenfeld <Jason@xxxxxxxxx>
Cc: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx>
To: Christoph Lameter <cl@xxxxxxxxx>
To: David Rientjes <rientjes@xxxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Roman Gushchin <roman.gushchin@xxxxxxxxx>
Cc: Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx>
Cc: linux-mm@xxxxxxxxx
Cc: linux-kernel@xxxxxxxxxxxxxxx
Cc: rcu@xxxxxxxxxxxxxxx

Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx>
---
Vlastimil Babka (6):
mm, slab: make caches with refcount of 0 unmergeable
mm, slab: always maintain per-node slab and object count
mm, slab: unlink sysfs and debugfs immediately
mm, slab: simplify kmem_cache_release()
mm, slab: asynchronously destroy caches with outstanding objects
kunit, slub: add test_kfree_rcu()

lib/slub_kunit.c | 22 ++++++++++
mm/slab.h | 4 +-
mm/slab_common.c | 121 +++++++++++++++++++++++++++++++++++--------------------
mm/slub.c | 58 ++++++++++++--------------
4 files changed, 129 insertions(+), 76 deletions(-)
---
base-commit: 436381eaf2a423e60fc8340399f7d2458091b383
change-id: 20240715-b4-slab-kfree_rcu-destroy-85dd2b2ded92

Best regards,
--
Vlastimil Babka <vbabka@xxxxxxx>