Re: [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug

From: Chuyi Zhou

Date: Thu Jun 04 2026 - 20:26:14 EST

On 2026-06-05 1:46 a.m., Paul E. McKenney wrote:
> On Wed, Jun 03, 2026 at 09:21:10PM +0800, Chuyi Zhou wrote:
>> Hi Paul,
>>
>> On 2026-05-23 12:27 p.m., Chuyi Zhou wrote:
>>> Commit b0473dcd4b1d ("smp: Improve smp_call_function_single()
>>> CSD-lock diagnostics") made smp_call_function_single() use the destination
>>> CPU's csd_data when CSD lock debugging is enabled. That lets the debug code
>>> associate a stuck CSD lock with the target CPU, but it also means the CPU
>>> argument is used in per_cpu_ptr() before generic_exec_single() has a chance
>>> to validate it.
>>>
>>> This becomes unsafe when smp_call_function_any() cannot find an online CPU
>>> in the supplied mask. In that case the selected CPU can be nr_cpu_ids, and
>>> the !wait path calls get_single_csd_data(cpu) before generic_exec_single()
>>> returns -ENXIO. With csdlock_debug_enabled set, that indexes the per-CPU
>>> offset array with an invalid CPU number.
>>>
>>> Use the destination CPU's csd_data only when the CPU number is within
>>> nr_cpu_ids. For invalid CPU numbers, fall back to the local CPU's csd_data
>>> and let generic_exec_single() perform the existing validation and return
>>> -ENXIO.
>>>
>>> Fixes: b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock diagnostics")
>>> Signed-off-by: Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx>
>>> Reviewed-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
>>
>> I hit a soft-lockup panic while running scftorture on a large x86_64 SMP
>> machine with a kernel based on origin/sched/core. The issue appears to
>> be a regression from:
>>
>> b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock
>> diagnostics")
>>
>> The reproduction environment was:
>>
>> base: origin/sched/core
>> machine: large x86_64 SMP system;
>> CONFIG_CSD_LOCK_WAIT_DEBUG=y
>>
>> The workload was:
>>
>> modprobe scftorture nthreads=1024 onoff_holdoff=50 onoff_interval=100
>>
>> The visible panic was a secondary soft lockup. CPU 78 was exiting a
>> process and got stuck waiting for CPU 81 to run a synchronous TLB
>> shootdown callback:
>>
>> watchdog: BUG: soft lockup - CPU#78 stuck for 30s! [nc:13732]
>> Kernel panic - not syncing: softlockup: hung tasks
>>
>> smp: csd: Detected non-responsive CSD lock (#118) on CPU#78,
>> waiting 5000000032 ns for CPU#81 flush_tlb_func+0x0/0x290(...)
>> smp: csd: Re-sending CSD lock (#118) IPI from CPU#78 to CPU#81
>>
>> CPU 78's stack was:
>>
>> csd_lock_wait_toolong
>> smp_call_function_many_cond
>> on_each_cpu_cond_mask
>> flush_tlb_mm_range
>> tlb_finish_mmu
>> exit_mmap
>> __mmput
>> do_exit
>> do_group_exit
>> __x64_sys_exit_group
>>
>> CPU 81 was running a scftorture invoker and was already inside the
>> call-function-single interrupt path:
>>
>> CPU: 81 PID: 13823 Comm: scftorture_invo
>> RIP: __flush_smp_call_function_queue+0x70/0x4d0
>> __sysvec_call_function_single
>> sysvec_call_function_single
>> asm_sysvec_call_function_single
>> ktime_get_mono_fast_ns
>> csd_lock_wait_toolong
>> smp_call_function_single
>> scftorture_invoker
>>
>> The dump showed CPU 81 scanning a corrupted async scftorture CSD:
>>
>> ff157dfa7fe72e40: ff157dfa7fe72e40 0051007000000001
>> func = scf_handler_1 [scftorture]
>>
>> Decoding that CSD:
>>
>> u_flags = 0x1 -> CSD_FLAG_LOCK | CSD_TYPE_ASYNC
>> src = 112
>> dst = 81
>>
>> The first word is the llist next pointer, and it points back to the CSD
>> itself:
>>
>> csd->node.llist.next == csd
>>
>> That self-loop explains why CPU 81 made no progress in
>> __flush_smp_call_function_queue().
>>
>> There was later work still pending on CPU 81's live call_single_queue:
>>
>> call_single_queue[81].first = ff157dfa7fe79c00
>>
>> That queued entry was CPU 78's synchronous TLB callback:
>>
>> u_flags = 0x11 -> CSD_FLAG_LOCK | CSD_TYPE_SYNC
>> src = 78
>> dst = 81
>> func = flush_tlb_func
>>
>> So CPU 78's TLB shootdown CSD was queued, but CPU 81 was already stuck
>> in an earlier invocation of __flush_smp_call_function_queue().
>>
>> The root cause seems to be that b0473dcd4b1d changes the !wait path of
>> smp_call_function_single() to use the destination CPU's csd_data when
>> csdlock_debug_enabled is set:
>>
>> if (static_branch_unlikely(&csdlock_debug_enabled))
>> return per_cpu_ptr(&csd_data, cpu);
>> return this_cpu_ptr(&csd_data);
>>
>> The important invariant is that the internal smp-call CSDs are normally
>> single-writer objects. With the old this_cpu_ptr(&csd_data) allocation,
>> smp_call_function_single(..., wait = 0) used the current CPU's csd_data,
>> so each sending CPU had its own CSD. smp_call_function() goes through
>> smp_call_function_many_cond(), and that path also starts from the
>> current CPU's cfd_data:
>>
>> cfd = this_cpu_ptr(&cfd_data);
>> csd = per_cpu_ptr(cfd->csd, cpu);
>>
>> get_cpu() keeps preemption disabled while the CSD is prepared and
>> queued, so another task on the same CPU cannot concurrently reuse that
>> sender's cfd/csd storage either.
>>
>> This is why the non-atomic csd_lock() worked before: the target CPU
>> could still observe and unlock the CSD from the previous request, but
>> there was only one CPU that could prepare and enqueue that particular
>> CSD object. With the debug-mode destination-CPU allocation, all CPUs
>> sending async single calls to CPU 81 contend for the same object:
>>
>> csd_data[81] == ff157dfa7fe72e40
>>
>> However, csd_lock() is not an atomic test-and-set lock:
>>
>> csd_lock_wait(csd);
>> csd->node.u_flags |= CSD_FLAG_LOCK;
>>
>> That is safe under the previous single-writer rule, but it is not safe
>> once many remote CPUs can acquire the same destination CPU CSD. Two
>> senders can both observe the lock clear, both set CSD_FLAG_LOCK,
>> overwrite func/info/src/dst and then both enqueue the same llist node:
>>
>> llist_add(&csd->node.llist, &per_cpu(call_single_queue, 81))
>>
>> If the same node is added again while it is already the queue head,
>> llist_add() writes the old head into new->next. Since old head == new,
>> this produces:
>>
>> csd->node.llist.next = csd
>>
>> That exactly matches the dump.
>>
>> I see two possible fixes:
>>
>> 1. Revert b0473dcd4b1d. This restores the old source-CPU csd_data
>> allocation and therefore restores the single-writer property, at the
>> cost of losing the improved destination-CPU CSD lock diagnostics.
>>
>> 2. Keep the destination-CPU csd_data behavior for CSD lock debugging,
>> but make csd_lock() an atomic test-and-set lock when
>> csdlock_debug_enabled is set.
>
> Thank you for the analysis and patch, and apologies for my confusion
> with b0473dcd4b1d!
>
> It looks plausible, but unfortunately, I was not able to find a version
> to which this patch applies, even after fixing the whitespace breakage
> noted below. Which might be another broken line or other whitespace
> error that my eyes missed. But even with --ignore-whitespace, it still
> does not apply.
>
> Please also see below for a way to shrink this a bit.
> If you update the patch and tell me what commit to apply it to, I will
> continue testing and review.
>
>

Thank you, Paul.

Sorry for the broken inline patch. It was meant mostly to confirm the
root cause and the possible fix direction, but I should have made that
clearer.

I will send a proper patch series based on tip/sched/core next week, and
I will include the exact base commit in the cover letter. The series
will have two patches:

1. smp: Avoid invalid per-CPU CSD lookup with CSD lock debug

2. smp: Make csd_lock() atomic for CSD lock debug destination CSDs

This fixes the multi-writer case when CSD lock debugging makes
smp_call_function_single(..., wait = 0) use the destination CPU's
csd_data.

Thanks for looking at this and for the suggestion!