Re: [PATCH v8 04/14] smp: Use task-local IPI cpumask in smp_call_function_many_cond()

From: Chuyi Zhou

Date: Fri Jun 26 2026 - 11:52:48 EST

On 2026-06-26 10:29 p.m., Thomas Gleixner wrote:
> On Tue, Jun 16 2026 at 19:11, Chuyi Zhou wrote:
>> This patch prepares the task-local IPI cpumask during thread creation, and
>> uses the local cpumask to replace the percpu cfd cpumask in
>> smp_call_function_many_cond(). We will enable preemption during
>> csd_lock_wait() later, and this can prevent concurrent access to the
>> cfd->cpumask from other tasks on the current CPU. For cases where
>> cpumask_size() is smaller than or equal to the pointer size, it tries to
>> stash the cpumask in the pointer itself to avoid extra memory allocations.
>
> This one fails the comprehensible test and also does not match the rules of
> how change logs should be written.
>
>> +#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPTION)
>> + union {
>> + cpumask_t *ipi_mask_ptr;
>> + unsigned long ipi_mask_val;
>
> Indentation of the variable name wants TABs not spaces
>
>> @@ -933,10 +934,14 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>> #endif
>> account_kernel_stack(tsk, 1);
>>
>> - err = scs_prepare(tsk, node);
>> + err = smp_task_ipi_mask_alloc(tsk);
>
> Hrm. So we unconditionally allocate another per task CPU mask. How many
> task actually utilize it?
>
> We keep making task_struct and the related things larger every other
> release without actually looking at the resulting overall memory
> consumption.
>

Thanks, this is a fair concern.

The task-local cpumask approach came from the earlier discussion with
Sebastian and Nadav. The problem we tried to solve there was the
lifetime of the wait mask once the later patch re-enables preemption
before csd_lock_wait(). At that point the wait mask can no longer be the
per-CPU cfd->cpumask: the task may be preempted or migrate while it is
still iterating the mask, and another task running on the original CPU
could enter smp_call_function_many_cond() and reuse that per-CPU mask.

I agree that the memory cost needs to be called out explicitly. The
current implementation trades one task-local cpumask for a stable mask
lifetime and avoids adding allocation/failure handling to the generic
IPI path.

I considered avoiding the fork-time allocation, but the alternatives do
not look straightforward:

- stack storage is not suitable for large NR_CPUS/CPUMASK_OFFSTACK
configurations;

- per-CPU storage is exactly what becomes unsafe once the wait is made
preemptible;

- allocating the mask in smp_call_function_many_cond() would put an
allocation in the generic IPI path. It also cannot rely on a sleeping
allocation because this function is entered from contexts which have
historically only required preemption to be disabled. Using GFP_ATOMIC
would need a failure/fallback path, in which case the latency
improvement becomes opportunistic rather than guaranteed.

For the motivating x86 TLB flush paths, the users are also not a small
static set of tasks. Ordinary tasks can hit this through exit, unmap,
reclaim, etc., so I do not see a clean way to allocate this only for a
pre-identifiable subset of tasks.