Re: [PATCH v8 04/14] smp: Use task-local IPI cpumask in smp_call_function_many_cond()

From: Chuyi Zhou

Date: Fri Jun 26 2026 - 20:52:57 EST

On 2026-06-27 3:07 a.m., Thomas Gleixner wrote:
> On Fri, Jun 26 2026 at 23:47, Chuyi Zhou wrote:
>> On 2026-06-26 10:29 p.m., Thomas Gleixner wrote:
>>>> - err = scs_prepare(tsk, node);
>>>> + err = smp_task_ipi_mask_alloc(tsk);
>>>
>>> Hrm. So we unconditionally allocate another per task CPU mask. How many
>>> task actually utilize it?
>>>
>>> We keep making task_struct and the related things larger every other
>>> release without actually looking at the resulting overall memory
>>> consumption.
>>>
>>
>> Thanks, this is a fair concern.
>>
>> The task-local cpumask approach came from the earlier discussion with
>> Sebastian and Nadav. The problem we tried to solve there was the
>> lifetime of the wait mask once the later patch re-enables preemption
>> before csd_lock_wait(). At that point the wait mask can no longer be the
>> per-CPU cfd->cpumask: the task may be preempted or migrate while it is
>> still iterating the mask, and another task running on the original CPU
>> could enter smp_call_function_many_cond() and reuse that per-CPU mask.
>>
>> I agree that the memory cost needs to be called out explicitly. The
>> current implementation trades one task-local cpumask for a stable mask
>> lifetime and avoids adding allocation/failure handling to the generic
>> IPI path.
>>
>> I considered avoiding the fork-time allocation, but the alternatives do
>> not look straightforward:
>>
>> - stack storage is not suitable for large NR_CPUS/CPUMASK_OFFSTACK
>> configurations;
>>
>> - per-CPU storage is exactly what becomes unsafe once the wait is made
>> preemptible;
>>
>> - allocating the mask in smp_call_function_many_cond() would put an
>> allocation in the generic IPI path. It also cannot rely on a sleeping
>> allocation because this function is entered from contexts which have
>> historically only required preemption to be disabled. Using GFP_ATOMIC
>> would need a failure/fallback path, in which case the latency
>> improvement becomes opportunistic rather than guaranteed.
>>
>> For the motivating x86 TLB flush paths, the users are also not a small
>> static set of tasks. Ordinary tasks can hit this through exit, unmap,
>> reclaim, etc., so I do not see a clean way to allocate this only for a
>> pre-identifiable subset of tasks.
>
> I understand that, but this all wants to be spelled out in the change
> log and explained.

Understood. Thanks for going through the series and for the detailed review.

I will fold this into the changelog and spell out:

- why the wait mask needs task-local lifetime once csd_lock_wait()
becomes preemptible;

- why per-CPU, stack, and in-call allocation are not good fits here;

- why this is not limited to a small, pre-identifiable set of tasks.
On x86, ordinary tasks can hit smp_call_function_many_cond() through
TLB flush paths such as exit, unmap and reclaim;

- the memory cost, including the inline case for small CPU counts and
the cpumask_size() allocation on larger systems.

I will also address your comments on the other patches in the next version.

Thanks