Re: [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance

From: Paul E. McKenney

Date: Thu May 28 2026 - 15:50:46 EST

On Thu, May 28, 2026 at 11:13:26PM +0800, Chuyi Zhou wrote:
> Changes in v6:
> - Make the task-local cpumask selection explicit and drop preemptible()
> check in smp_call_function_many_cond(). The early put_cpu() decision now
> depends only on whether a task-local cpumask is available.
> - Keep smp_task_ipi_mask() private to kernel/smp.c in [PATCH v6 4/12].
> - Add #include <linux/slab.h> to kernel/smp.c in [PATCH v6 4/12] for
> kmalloc()/kfree(), fixing the kernel test robot build failure reported
> at: https://lore.kernel.org/oe-kbuild-all/202605241101.w6T2LApw-lkp@xxxxxxxxx/
> - Update the csd_lock_wait() comment in [PATCH v6 6/12].
> - Add Sebastian's Reviewed-by tags to the reviewed patches.

For the series:

Tested-by: Paul E. McKenney <paulmck@xxxxxxxxxx>

> Changes in v5:
> - Replace "smp: Remove get_cpu from smp_call_function_any" with a new
> approach that extracts a common __smp_call_function_single() to safely
> keep the remote CPU selection and IPI dispatch process within a single
> preemption-disabled region in [PATCH v5 3/12].
> - Fix a typo in comments (s/cpumask_stack/task_mask/) and remove the
> obsolete "Preemption must be disabled" constraint from the kernel-doc
> in [PATCH v5 6/12].
> - Adjust the WARN_ON_ONCE() validation condition to avoid a false positive
> warning caused by CPU hotplug races when use_cpus_read_lock is false in
> [PATCH v5 9/12].
> - Move the preemptible() check in smp_call_function_many_cond() from
> [PATCH v5 4/12] to [PATCH v5 6/12].
> - Rebase to commit 4ac4d6549a65 ("sched: Use trace_call__<tp>() to save a
> static branch").
>
> Changes in v4:
> - Use task-local IPI cpumask rather than on-stack cpumask in
> [PATCH v4 4/12] (suggested by sebastian).
> - Skip to free csd memory in smpcfd_dead_cpu() to guarantee csd memory
> access safety, instead of using RCU mechanism in [PATCH v4 5/12]
> (suggested by sebastian).
> - Align flush_tlb_info with SMP_CACHE_BYTES to avoid performance
> degradation caused by unnecessary cache line movements in [PATCH v4
> 10/12](suggested by sebastian and Nadav).
> - Collect Acked-bys and Reviewed-bys.
>
> Changes in v3:
> - Add benchmarks to measure the performance impact of changing
> flush_tlb_info to stack variable in [PATCH v3 10/12] (suggested by
> peter)
> - Adjust the rcu_read_unlock() location in [PATCH v3 5/12] (suggested
> by muchun)
> - Use raw_smp_processor_id() to prevent warning[1] from
> check_preemption_disabled() in [PATCH v3 12/12].
> - Collect Acked-bys and Reviewed-by.
>
> [1]: https://lore.kernel.org/lkml/20260302075216.2170675-1-zhouchuyi@xxxxxxxxxxxxx/T/#mc39999cbeb3f50be176f0903d0fa4075688b073d
>
> Changes in v2:
> - Simplify the code comments in [PATCH v2 2/12] (pointed by peter and
> muchun)
> - Adjust the preemption disabling logic in smp_call_function_any() in
> [PATCH v2 3/12] (suggested by peter).
> - Use on-stack cpumask only when !CONFIG_CPUMASK_OFFSTACK in [PATCH V2
> 4/12] (pointed by peter)
> - Add [PATCH v2 5/12] to replace migrate_disable with the rcu mechanism
> - Adjust the preemption disabling logic to allow flush_tlb_multi() to be
> preemptible and migratable in [PATCH v2 11/12]
> - Collect Acked-bys and Reviewed-bys
>
> Introduction
> ============
>
> The vast majority of smp_call_function*() callers block until remote CPUs
> complete the IPI function execution. As smp_call_function*() runs with
> preemption disabled throughout, scheduling latency increases dramatically
> with the number of remote CPUs and other factors (such as interrupts being
> disabled).
>
> On x86-64 architectures, TLB flushes are performed via IPIs; thus, during
> process exit or when process-mapped pages are reclaimed, numerous IPI
> operations must be awaited, leading to increased scheduling latency for
> other threads on the current CPU. In our production environment, we
> observed IPI wait-induced scheduling latency reaching up to 16ms on a
> 16-core machine. Our goal is to allow preemption during IPI completion
> waiting to improve real-time performance.
>
> Background
> ============
>
> In our production environments, latency-sensitive workloads (DPDK) are
> configured with the highest priority to preempt lower-priority tasks at any
> time. We discovered that DPDK's wake-up latency is primarily caused by the
> current CPU having preemption disabled. Therefore, we collected the maximum
> preemption disabled events within every 30-second interval and then
> calculated the P50/P99 of these max preemption disabled events:
>
>
> p50(ns) p99(ns)
> cpu0 254956 5465050
> cpu1 115801 120782
> cpu2 43324 72957
> cpu3 256637 16723307
> cpu4 58979 87237
> cpu5 47464 79815
> cpu6 48881 81371
> cpu7 52263 82294
> cpu8 263555 4657713
> cpu9 44935 73962
> cpu10 37659 65026
> cpu11 257008 2706878
> cpu12 49669 90006
> cpu13 45186 74666
> cpu14 60705 83866
> cpu15 51311 86885
>
> Meanwhile, we have collected the distribution of preemption disabling
> events exceeding 1ms across different CPUs over several hours(I omitted
> CPU data that were all zeros):
>
> CPU 1~10ms 10~50ms 50~100ms
> cpu0 29 5 0
> cpu3 38 13 0
> cpu8 34 6 0
> cpu11 24 10 0
>
> The preemption disabled for several milliseconds or even 10ms+ mostly
> originates from TLB flush:
>
> @stack[
> trace_preempt_on+143
> trace_preempt_on+143
> preempt_count_sub+67
> arch_tlbbatch_flush/flush_tlb_mm_range
> task_exit/page_reclaim/...
> ]
>
> Further analysis confirms that the majority of the time is consumed in
> csd_lock_wait().
>
> Now smp_call*() always needs to disable preemption, mainly to protect its
> internal per‑CPU data structures and synchronize with CPU offline
> operations. This patchset attempts to make csd_lock_wait() preemptible,
> thereby reducing the preemption‑disabled critical section and improving
> kernel real‑time performance.
>
> Effect
>
> ======
>
> After applying this patchset, we no longer observe preemption disabled for
> more than 1ms on the arch_tlbbatch_flush/flush_tlb_mm_range path. The
> overall P99 of max preemption disabled events in every 30-second is
> reduced to around 1.5ms (the remaining latency is primarily due to lock
> contention.
>
> before patch after patch reduced by
> ----------- -------------- ------------
> p99(ns) 16723307 1556034 ~90.70%
>
> Chuyi Zhou (12):
> smp: Disable preemption explicitly in __csd_lock_wait
> smp: Enable preemption early in smp_call_function_single
> smp: Refactor remote CPU selection in smp_call_function_any()
> smp: Use task-local IPI cpumask in smp_call_function_many_cond()
> smp: Alloc percpu csd data in smpcfd_prepare_cpu() only once
> smp: Enable preemption early in smp_call_function_many_cond
> smp: Remove preempt_disable from smp_call_function
> smp: Remove preempt_disable from on_each_cpu_cond_mask
> scftorture: Remove preempt_disable in scftorture_invoke_one
> x86/mm: Move flush_tlb_info back to the stack
> x86/mm: Enable preemption during native_flush_tlb_multi
> x86/mm: Enable preemption during flush_tlb_kernel_range
>
> arch/x86/include/asm/tlbflush.h | 8 +-
> arch/x86/kernel/kvm.c | 4 +-
> arch/x86/mm/tlb.c | 86 ++++++-----------
> include/linux/sched.h | 6 ++
> include/linux/smp.h | 15 +++
> kernel/fork.c | 9 +-
> kernel/scftorture.c | 13 +--
> kernel/smp.c | 161 ++++++++++++++++++++++++--------
> 8 files changed, 194 insertions(+), 108 deletions(-)
>
> --
> 2.20.1