Re: [PATCH v8 04/14] smp: Use task-local IPI cpumask in smp_call_function_many_cond()

From: Chuyi Zhou

Date: Fri Jun 26 2026 - 12:08:15 EST

On 2026-06-26 11:47 p.m., Chuyi Zhou wrote:
> On 2026-06-26 10:29 p.m., Thomas Gleixner wrote:
>> On Tue, Jun 16 2026 at 19:11, Chuyi Zhou wrote:
>>> This patch prepares the task-local IPI cpumask during thread creation, and
>>> uses the local cpumask to replace the percpu cfd cpumask in
>>> smp_call_function_many_cond(). We will enable preemption during
>>> csd_lock_wait() later, and this can prevent concurrent access to the
>>> cfd->cpumask from other tasks on the current CPU. For cases where
>>> cpumask_size() is smaller than or equal to the pointer size, it tries to
>>> stash the cpumask in the pointer itself to avoid extra memory allocations.
>>
>> This one fails the comprehensible test and also does not match the rules of
>> how change logs should be written.
>>
>>> +#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPTION)
>>> + union {
>>> + cpumask_t *ipi_mask_ptr;
>>> + unsigned long ipi_mask_val;
>>
>> Indentation of the variable name wants TABs not spaces
>>
>>> @@ -933,10 +934,14 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>>> #endif
>>> account_kernel_stack(tsk, 1);
>>>
>>> - err = scs_prepare(tsk, node);
>>> + err = smp_task_ipi_mask_alloc(tsk);
>>
>> Hrm. So we unconditionally allocate another per task CPU mask. How many
>> task actually utilize it?
>>
>> We keep making task_struct and the related things larger every other
>> release without actually looking at the resulting overall memory
>> consumption.
>>
>
> Thanks, this is a fair concern.
>
> The task-local cpumask approach came from the earlier discussion with
> Sebastian and Nadav. The problem we tried to solve there was the
> lifetime of the wait mask once the later patch re-enables preemption
> before csd_lock_wait(). At that point the wait mask can no longer be the
> per-CPU cfd->cpumask: the task may be preempted or migrate while it is
> still iterating the mask, and another task running on the original CPU
> could enter smp_call_function_many_cond() and reuse that per-CPU mask.
>
> I agree that the memory cost needs to be called out explicitly. The
> current implementation trades one task-local cpumask for a stable mask
> lifetime and avoids adding allocation/failure handling to the generic
> IPI path.
>
> I considered avoiding the fork-time allocation, but the alternatives do
> not look straightforward:
>
> - stack storage is not suitable for large NR_CPUS/CPUMASK_OFFSTACK
> configurations;
>
> - per-CPU storage is exactly what becomes unsafe once the wait is made
> preemptible;
>
> - allocating the mask in smp_call_function_many_cond() would put an
> allocation in the generic IPI path. It also cannot rely on a sleeping
> allocation because this function is entered from contexts which have
> historically only required preemption to be disabled. Using GFP_ATOMIC
> would need a failure/fallback path, in which case the latency
> improvement becomes opportunistic rather than guaranteed.
>
> For the motivating x86 TLB flush paths, the users are also not a small
> static set of tasks. Ordinary tasks can hit this through exit, unmap,
> reclaim, etc., so I do not see a clean way to allocate this only for a
> pre-identifiable subset of tasks.

To put some numbers around the memory side:

On my current x86-64 build, task_struct is 3264 bytes. The patch adds
one word to task_struct. With NR_CPUS <= 64 on 64-bit, cpumask_size()
fits in that word, so there is no separate allocation.

The worst case is the large CPU configuration. With NR_CPUS=8192,
cpumask_size() is 1024 bytes, so this becomes one extra 1KiB allocation
per task, plus the word in task_struct. That is a real cost, especially
on systems with many tasks, and I should document it explicitly.

This was also part of the earlier discussion with Sebastian and Nadav.
Embedding a plain cpumask_t would have made task_struct grow by 1KiB in
that configuration, so the approach here was to keep only one word in
task_struct, store the mask inline when it fits, and otherwise allocate
only cpumask_size() while creating the task.

For comparison, x86 already has several KiB of per-task arch/FPU state.
On my current build, struct fpu is 4224 bytes. That does not make the
extra cpumask free, but it puts the 8192-CPU worst case in context.