Re: [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug

From: Chuyi Zhou

Date: Wed Jun 03 2026 - 09:30:12 EST

Hi Paul,

On 2026-05-23 12:27 p.m., Chuyi Zhou wrote:
> Commit b0473dcd4b1d ("smp: Improve smp_call_function_single()
> CSD-lock diagnostics") made smp_call_function_single() use the destination
> CPU's csd_data when CSD lock debugging is enabled. That lets the debug code
> associate a stuck CSD lock with the target CPU, but it also means the CPU
> argument is used in per_cpu_ptr() before generic_exec_single() has a chance
> to validate it.
>
> This becomes unsafe when smp_call_function_any() cannot find an online CPU
> in the supplied mask. In that case the selected CPU can be nr_cpu_ids, and
> the !wait path calls get_single_csd_data(cpu) before generic_exec_single()
> returns -ENXIO. With csdlock_debug_enabled set, that indexes the per-CPU
> offset array with an invalid CPU number.
>
> Use the destination CPU's csd_data only when the CPU number is within
> nr_cpu_ids. For invalid CPU numbers, fall back to the local CPU's csd_data
> and let generic_exec_single() perform the existing validation and return
> -ENXIO.
>
> Fixes: b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock diagnostics")
> Signed-off-by: Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx>
> Reviewed-by: Paul E. McKenney <paulmck@xxxxxxxxxx>

I hit a soft-lockup panic while running scftorture on a large x86_64 SMP
machine with a kernel based on origin/sched/core. The issue appears to
be a regression from:

b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock
diagnostics")

The reproduction environment was:

base: origin/sched/core
machine: large x86_64 SMP system;
CONFIG_CSD_LOCK_WAIT_DEBUG=y

The workload was:

modprobe scftorture nthreads=1024 onoff_holdoff=50 onoff_interval=100

The visible panic was a secondary soft lockup. CPU 78 was exiting a
process and got stuck waiting for CPU 81 to run a synchronous TLB
shootdown callback:

watchdog: BUG: soft lockup - CPU#78 stuck for 30s! [nc:13732]
Kernel panic - not syncing: softlockup: hung tasks

smp: csd: Detected non-responsive CSD lock (#118) on CPU#78,
waiting 5000000032 ns for CPU#81 flush_tlb_func+0x0/0x290(...)
smp: csd: Re-sending CSD lock (#118) IPI from CPU#78 to CPU#81

CPU 78's stack was:

csd_lock_wait_toolong
smp_call_function_many_cond
on_each_cpu_cond_mask
flush_tlb_mm_range
tlb_finish_mmu
exit_mmap
__mmput
do_exit
do_group_exit
__x64_sys_exit_group

CPU 81 was running a scftorture invoker and was already inside the
call-function-single interrupt path:

CPU: 81 PID: 13823 Comm: scftorture_invo
RIP: __flush_smp_call_function_queue+0x70/0x4d0
__sysvec_call_function_single
sysvec_call_function_single
asm_sysvec_call_function_single
ktime_get_mono_fast_ns
csd_lock_wait_toolong
smp_call_function_single
scftorture_invoker

The dump showed CPU 81 scanning a corrupted async scftorture CSD:

ff157dfa7fe72e40: ff157dfa7fe72e40 0051007000000001
func = scf_handler_1 [scftorture]

Decoding that CSD:

u_flags = 0x1 -> CSD_FLAG_LOCK | CSD_TYPE_ASYNC
src = 112
dst = 81

The first word is the llist next pointer, and it points back to the CSD
itself:

csd->node.llist.next == csd

That self-loop explains why CPU 81 made no progress in
__flush_smp_call_function_queue().

There was later work still pending on CPU 81's live call_single_queue:

call_single_queue[81].first = ff157dfa7fe79c00

That queued entry was CPU 78's synchronous TLB callback:

u_flags = 0x11 -> CSD_FLAG_LOCK | CSD_TYPE_SYNC
src = 78
dst = 81
func = flush_tlb_func

So CPU 78's TLB shootdown CSD was queued, but CPU 81 was already stuck
in an earlier invocation of __flush_smp_call_function_queue().

The root cause seems to be that b0473dcd4b1d changes the !wait path of
smp_call_function_single() to use the destination CPU's csd_data when
csdlock_debug_enabled is set:

if (static_branch_unlikely(&csdlock_debug_enabled))
return per_cpu_ptr(&csd_data, cpu);
return this_cpu_ptr(&csd_data);

The important invariant is that the internal smp-call CSDs are normally
single-writer objects. With the old this_cpu_ptr(&csd_data) allocation,
smp_call_function_single(..., wait = 0) used the current CPU's csd_data,
so each sending CPU had its own CSD. smp_call_function() goes through
smp_call_function_many_cond(), and that path also starts from the
current CPU's cfd_data:

cfd = this_cpu_ptr(&cfd_data);
csd = per_cpu_ptr(cfd->csd, cpu);

get_cpu() keeps preemption disabled while the CSD is prepared and
queued, so another task on the same CPU cannot concurrently reuse that
sender's cfd/csd storage either.

This is why the non-atomic csd_lock() worked before: the target CPU
could still observe and unlock the CSD from the previous request, but
there was only one CPU that could prepare and enqueue that particular
CSD object. With the debug-mode destination-CPU allocation, all CPUs
sending async single calls to CPU 81 contend for the same object:

csd_data[81] == ff157dfa7fe72e40

However, csd_lock() is not an atomic test-and-set lock:

csd_lock_wait(csd);
csd->node.u_flags |= CSD_FLAG_LOCK;

That is safe under the previous single-writer rule, but it is not safe
once many remote CPUs can acquire the same destination CPU CSD. Two
senders can both observe the lock clear, both set CSD_FLAG_LOCK,
overwrite func/info/src/dst and then both enqueue the same llist node:

llist_add(&csd->node.llist, &per_cpu(call_single_queue, 81))

If the same node is added again while it is already the queue head,
llist_add() writes the old head into new->next. Since old head == new,
this produces:

csd->node.llist.next = csd

That exactly matches the dump.

I see two possible fixes:

1. Revert b0473dcd4b1d. This restores the old source-CPU csd_data
allocation and therefore restores the single-writer property, at the
cost of losing the improved destination-CPU CSD lock diagnostics.

2. Keep the destination-CPU csd_data behavior for CSD lock debugging,
but make csd_lock() an atomic test-and-set lock when
csdlock_debug_enabled is set.

One possible implementation is:

kernel/smp.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)

diff --git a/kernel/smp.c b/kernel/smp.c
index 15799f842746..e1d2fd3c9d9f 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -362,6 +362,34 @@ static __always_inline void
csd_lock_wait(call_single_data_t *csd)
}
#endif

+#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
+static __always_inline void csd_lock(call_single_data_t *csd)
+{
+ if (static_branch_unlikely(&csdlock_debug_enabled)) {
+ unsigned int flags;
+
+ for (;;) {
+ __csd_lock_wait(csd);
+ flags = READ_ONCE(csd->node.u_flags);
+ if (!(flags & CSD_FLAG_LOCK) &&
+ try_cmpxchg_acquire(&csd->node.u_flags, &flags,
+ flags | CSD_FLAG_LOCK))
+ break;
+ cpu_relax();
+ }
+ } else {
+ csd_lock_wait(csd);
+ csd->node.u_flags |= CSD_FLAG_LOCK;
+ }
+
+ /*
+ * prevent CPU from reordering the above assignment
+ * to ->flags with any subsequent assignments to other
+ * fields of the specified call_single_data_t structure:
+ */
+ smp_wmb();
+}
+#else
static __always_inline void csd_lock(call_single_data_t *csd)
{
csd_lock_wait(csd);
@@ -374,6 +402,7 @@ static __always_inline void
csd_lock(call_single_data_t *csd)
*/
smp_wmb();
}
+#endif

static __always_inline void csd_unlock(call_single_data_t *csd)
{
--