Re: [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug

From: Paul E. McKenney

Date: Thu Jun 04 2026 - 13:52:13 EST

On Wed, Jun 03, 2026 at 09:21:10PM +0800, Chuyi Zhou wrote:
> Hi Paul,
>
> On 2026-05-23 12:27 p.m., Chuyi Zhou wrote:
> > Commit b0473dcd4b1d ("smp: Improve smp_call_function_single()
> > CSD-lock diagnostics") made smp_call_function_single() use the destination
> > CPU's csd_data when CSD lock debugging is enabled. That lets the debug code
> > associate a stuck CSD lock with the target CPU, but it also means the CPU
> > argument is used in per_cpu_ptr() before generic_exec_single() has a chance
> > to validate it.
> >
> > This becomes unsafe when smp_call_function_any() cannot find an online CPU
> > in the supplied mask. In that case the selected CPU can be nr_cpu_ids, and
> > the !wait path calls get_single_csd_data(cpu) before generic_exec_single()
> > returns -ENXIO. With csdlock_debug_enabled set, that indexes the per-CPU
> > offset array with an invalid CPU number.
> >
> > Use the destination CPU's csd_data only when the CPU number is within
> > nr_cpu_ids. For invalid CPU numbers, fall back to the local CPU's csd_data
> > and let generic_exec_single() perform the existing validation and return
> > -ENXIO.
> >
> > Fixes: b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock diagnostics")
> > Signed-off-by: Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx>
> > Reviewed-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
>
> I hit a soft-lockup panic while running scftorture on a large x86_64 SMP
> machine with a kernel based on origin/sched/core. The issue appears to
> be a regression from:
>
> b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock
> diagnostics")
>
> The reproduction environment was:
>
> base: origin/sched/core
> machine: large x86_64 SMP system;
> CONFIG_CSD_LOCK_WAIT_DEBUG=y
>
> The workload was:
>
> modprobe scftorture nthreads=1024 onoff_holdoff=50 onoff_interval=100
>
> The visible panic was a secondary soft lockup. CPU 78 was exiting a
> process and got stuck waiting for CPU 81 to run a synchronous TLB
> shootdown callback:
>
> watchdog: BUG: soft lockup - CPU#78 stuck for 30s! [nc:13732]
> Kernel panic - not syncing: softlockup: hung tasks
>
> smp: csd: Detected non-responsive CSD lock (#118) on CPU#78,
> waiting 5000000032 ns for CPU#81 flush_tlb_func+0x0/0x290(...)
> smp: csd: Re-sending CSD lock (#118) IPI from CPU#78 to CPU#81
>
> CPU 78's stack was:
>
> csd_lock_wait_toolong
> smp_call_function_many_cond
> on_each_cpu_cond_mask
> flush_tlb_mm_range
> tlb_finish_mmu
> exit_mmap
> __mmput
> do_exit
> do_group_exit
> __x64_sys_exit_group
>
> CPU 81 was running a scftorture invoker and was already inside the
> call-function-single interrupt path:
>
> CPU: 81 PID: 13823 Comm: scftorture_invo
> RIP: __flush_smp_call_function_queue+0x70/0x4d0
> __sysvec_call_function_single
> sysvec_call_function_single
> asm_sysvec_call_function_single
> ktime_get_mono_fast_ns
> csd_lock_wait_toolong
> smp_call_function_single
> scftorture_invoker
>
> The dump showed CPU 81 scanning a corrupted async scftorture CSD:
>
> ff157dfa7fe72e40: ff157dfa7fe72e40 0051007000000001
> func = scf_handler_1 [scftorture]
>
> Decoding that CSD:
>
> u_flags = 0x1 -> CSD_FLAG_LOCK | CSD_TYPE_ASYNC
> src = 112
> dst = 81
>
> The first word is the llist next pointer, and it points back to the CSD
> itself:
>
> csd->node.llist.next == csd
>
> That self-loop explains why CPU 81 made no progress in
> __flush_smp_call_function_queue().
>
> There was later work still pending on CPU 81's live call_single_queue:
>
> call_single_queue[81].first = ff157dfa7fe79c00
>
> That queued entry was CPU 78's synchronous TLB callback:
>
> u_flags = 0x11 -> CSD_FLAG_LOCK | CSD_TYPE_SYNC
> src = 78
> dst = 81
> func = flush_tlb_func
>
> So CPU 78's TLB shootdown CSD was queued, but CPU 81 was already stuck
> in an earlier invocation of __flush_smp_call_function_queue().
>
> The root cause seems to be that b0473dcd4b1d changes the !wait path of
> smp_call_function_single() to use the destination CPU's csd_data when
> csdlock_debug_enabled is set:
>
> if (static_branch_unlikely(&csdlock_debug_enabled))
> return per_cpu_ptr(&csd_data, cpu);
> return this_cpu_ptr(&csd_data);
>
> The important invariant is that the internal smp-call CSDs are normally
> single-writer objects. With the old this_cpu_ptr(&csd_data) allocation,
> smp_call_function_single(..., wait = 0) used the current CPU's csd_data,
> so each sending CPU had its own CSD. smp_call_function() goes through
> smp_call_function_many_cond(), and that path also starts from the
> current CPU's cfd_data:
>
> cfd = this_cpu_ptr(&cfd_data);
> csd = per_cpu_ptr(cfd->csd, cpu);
>
> get_cpu() keeps preemption disabled while the CSD is prepared and
> queued, so another task on the same CPU cannot concurrently reuse that
> sender's cfd/csd storage either.
>
> This is why the non-atomic csd_lock() worked before: the target CPU
> could still observe and unlock the CSD from the previous request, but
> there was only one CPU that could prepare and enqueue that particular
> CSD object. With the debug-mode destination-CPU allocation, all CPUs
> sending async single calls to CPU 81 contend for the same object:
>
> csd_data[81] == ff157dfa7fe72e40
>
> However, csd_lock() is not an atomic test-and-set lock:
>
> csd_lock_wait(csd);
> csd->node.u_flags |= CSD_FLAG_LOCK;
>
> That is safe under the previous single-writer rule, but it is not safe
> once many remote CPUs can acquire the same destination CPU CSD. Two
> senders can both observe the lock clear, both set CSD_FLAG_LOCK,
> overwrite func/info/src/dst and then both enqueue the same llist node:
>
> llist_add(&csd->node.llist, &per_cpu(call_single_queue, 81))
>
> If the same node is added again while it is already the queue head,
> llist_add() writes the old head into new->next. Since old head == new,
> this produces:
>
> csd->node.llist.next = csd
>
> That exactly matches the dump.
>
> I see two possible fixes:
>
> 1. Revert b0473dcd4b1d. This restores the old source-CPU csd_data
> allocation and therefore restores the single-writer property, at the
> cost of losing the improved destination-CPU CSD lock diagnostics.
>
> 2. Keep the destination-CPU csd_data behavior for CSD lock debugging,
> but make csd_lock() an atomic test-and-set lock when
> csdlock_debug_enabled is set.

Thank you for the analysis and patch, and apologies for my confusion
with b0473dcd4b1d!

It looks plausible, but unfortunately, I was not able to find a version
to which this patch applies, even after fixing the whitespace breakage
noted below. Which might be another broken line or other whitespace
error that my eyes missed. But even with --ignore-whitespace, it still
does not apply.

Please also see below for a way to shrink this a bit.
If you update the patch and tell me what commit to apply it to, I will
continue testing and review.

> One possible implementation is:
>
> kernel/smp.c | 29 +++++++++++++++++++++++++++++
> 1 file changed, 29 insertions(+)
>
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 15799f842746..e1d2fd3c9d9f 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -362,6 +362,34 @@ static __always_inline void
> csd_lock_wait(call_single_data_t *csd)

The above two lines need to be one line.

> }
> #endif
>
> +#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
> +static __always_inline void csd_lock(call_single_data_t *csd)
> +{
> + if (static_branch_unlikely(&csdlock_debug_enabled)) {

If you make the above line instead read as follows:

if (IS_ENABLED(CONFIG_CSD_LOCK_WAIT_DEBUG) &&
static_branch_unlikely(&csdlock_debug_enabled)) {

Then you can get rid of the #ifdef and the other version of the
csd_lock() function.

> + unsigned int flags;
> +
> + for (;;) {
> + __csd_lock_wait(csd);
> + flags = READ_ONCE(csd->node.u_flags);
> + if (!(flags & CSD_FLAG_LOCK) &&
> + try_cmpxchg_acquire(&csd->node.u_flags, &flags,
> + flags | CSD_FLAG_LOCK))
> + break;
> + cpu_relax();
> + }
> + } else {
> + csd_lock_wait(csd);
> + csd->node.u_flags |= CSD_FLAG_LOCK;
> + }
> +
> + /*
> + * prevent CPU from reordering the above assignment
> + * to ->flags with any subsequent assignments to other
> + * fields of the specified call_single_data_t structure:
> + */
> + smp_wmb();
> +}
> +#else
> static __always_inline void csd_lock(call_single_data_t *csd)
> {
> csd_lock_wait(csd);
> @@ -374,6 +402,7 @@ static __always_inline void
> csd_lock(call_single_data_t *csd)

And again, the above two lines need to be one line.

Thanx, Paul

> */
> smp_wmb();
> }
> +#endif
>
> static __always_inline void csd_unlock(call_single_data_t *csd)
> {
> --