Re: [PATCH] kmemleak: Avoid RCU stalls when freeing metadata for per-CPU pointers

From: Waiman Long
Date: Fri Dec 01 2023 - 17:22:36 EST


On 12/1/23 14:08, Catalin Marinas wrote:
On systems with large number of CPUs, the following soft lockup splat
might sometimes happen:

[ 2656.001617] watchdog: BUG: soft lockup - CPU#364 stuck for 21s! [ksoftirqd/364:2206]
:
[ 2656.141194] RIP: 0010:_raw_spin_unlock_irqrestore+0x3d/0x70
:
2656.241214] Call Trace:
[ 2656.243971] <IRQ>
[ 2656.246237] ? show_trace_log_lvl+0x1c4/0x2df
[ 2656.251152] ? show_trace_log_lvl+0x1c4/0x2df
[ 2656.256066] ? kmemleak_free_percpu+0x11f/0x1f0
[ 2656.261173] ? watchdog_timer_fn+0x379/0x470
[ 2656.265984] ? __pfx_watchdog_timer_fn+0x10/0x10
[ 2656.271179] ? __hrtimer_run_queues+0x5f3/0xd00
[ 2656.276283] ? __pfx___hrtimer_run_queues+0x10/0x10
[ 2656.281783] ? ktime_get_update_offsets_now+0x95/0x2c0
[ 2656.287573] ? ktime_get_update_offsets_now+0xdd/0x2c0
[ 2656.293380] ? hrtimer_interrupt+0x2e9/0x780
[ 2656.298221] ? __sysvec_apic_timer_interrupt+0x184/0x640
[ 2656.304211] ? sysvec_apic_timer_interrupt+0x8e/0xc0
[ 2656.309807] </IRQ>
[ 2656.312169] <TASK>
[ 2656.326110] kmemleak_free_percpu+0x11f/0x1f0
[ 2656.331015] free_percpu.part.0+0x1b/0xe70
[ 2656.335635] free_vfsmnt+0xb9/0x100
[ 2656.339567] rcu_do_batch+0x3c8/0xe30
[ 2656.363693] rcu_core+0x3de/0x5a0
[ 2656.367433] __do_softirq+0x2d0/0x9a8
[ 2656.381119] run_ksoftirqd+0x36/0x60
[ 2656.385145] smpboot_thread_fn+0x556/0x910
[ 2656.394971] kthread+0x2a4/0x350
[ 2656.402826] ret_from_fork+0x29/0x50
[ 2656.406861] </TASK>

The issue is caused by kmemleak registering each per_cpu_ptr()
corresponding to the __percpu pointer. This is unnecessary since such
individual per-CPU pointers are not tracked anyway. Create a new
object_percpu_tree_root rbtree that stores a single __percpu pointer
together with an OBJECT_PERCPU flag for the kmemleak metadata. Scanning
needs to be done for all per_cpu_ptr() pointers with a cond_resched()
between each CPU iteration to avoid RCU stalls.

Signed-off-by: Catalin Marinas <catalin.marinas@xxxxxxx>
Reported-by: Waiman Long <longman@xxxxxxxxxx>
Link: https://lore.kernel.org/r/20231127194153.289626-1-longman@xxxxxxxxxx
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

The only difference from the inlined patch I posted previously is some updated
comments to include the new object tree.

mm/kmemleak.c | 178 +++++++++++++++++++++++++++-----------------------
1 file changed, 97 insertions(+), 81 deletions(-)

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 1eacca03bedd..eb6cdc3e9af2 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -14,17 +14,15 @@
* The following locks and mutexes are used by kmemleak:
*
* - kmemleak_lock (raw_spinlock_t): protects the object_list as well as
- * del_state modifications and accesses to the object_tree_root (or
- * object_phys_tree_root). The object_list is the main list holding the
- * metadata (struct kmemleak_object) for the allocated memory blocks.
- * The object_tree_root and object_phys_tree_root are red
- * black trees used to look-up metadata based on a pointer to the
- * corresponding memory block. The object_phys_tree_root is for objects
- * allocated with physical address. The kmemleak_object structures are
- * added to the object_list and object_tree_root (or object_phys_tree_root)
- * in the create_object() function called from the kmemleak_alloc() (or
- * kmemleak_alloc_phys()) callback and removed in delete_object() called from
- * the kmemleak_free() callback
+ * del_state modifications and accesses to the object trees
+ * (object_tree_root, object_phys_tree_root, object_percpu_tree_root). The
+ * object_list is the main list holding the metadata (struct
+ * kmemleak_object) for the allocated memory blocks. The object trees are
+ * red black trees used to look-up metadata based on a pointer to the
+ * corresponding memory block. The kmemleak_object structures are added to
+ * the object_list and the object tree root in the create_object() function
+ * called from the kmemleak_alloc() (or kmemleak_alloc_phys()) callback and
+ * removed in delete_object() called from the kmemleak_free() callback

Just a minor nit. For completeness, should we mention kmemleak_alloc_percpu() and kmemleak_free_percpu() here?

Anyway, I won't mind if you want to keep it as it is.

Reviewed-by: Waiman Long <longman@xxxxxxxxxx>