Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes

From: Nadav Amit
Date: Mon Jan 20 2025 - 15:05:07 EST




> On 20 Jan 2025, at 18:09, Rik van Riel <riel@xxxxxxxxxxx> wrote:
>
> On Mon, 2025-01-20 at 16:02 +0200, Nadav Amit wrote:
>>
>>
>> On 20/01/2025 4:40, Rik van Riel wrote:
>>>
>>> +static inline void broadcast_tlb_flush(struct flush_tlb_info
>>> *info)
>>> +{
>>> + VM_WARN_ON_ONCE(1);
>>
>> Not sure why not the use VM_WARN_ONCE() instead with some more
>> informative message (anyhow, a string is allocated for it).
>>
> VM_WARN_ON_ONCE only has a condition, not a message.

Right, my bad.

>
>>> + /* Slower check to make sure. */
>>> + for_each_cpu(cpu, mm_cpumask(mm)) {
>>> + /* Skip the CPUs that aren't really running this
>>> process. */
>>> + if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
>>> + continue;
>>
>> Then perhaps at least add a comment next to loaded_mm, that it's not
>> private per-se, but rarely accessed by other cores?
>>
> I don't see any comment in struct tlb_state that
> suggests it was ever private to begin with.
>
> Which comment are you referring to that should
> be edited?

You can see there is a tlb_state_shared, so one assumes tlb_state is
private... (at least that was my intention separating them).

>
>>>
>>> + WRITE_ONCE(mm->context.asid_transition, true);
>>> + WRITE_ONCE(mm->context.global_asid, get_global_asid());
>>
>> I know it is likely correct in practice (due to TSO memory model),
>> but
>> it is not clear, at least for me, how those write order affects the
>> rest
>> of the code. I managed to figure out how it relates to the reads in
>> flush_tlb_mm_range() and native_flush_tlb_multi(), but I wouldn't say
>> it
>> is trivial and doesn't worth a comment (or smp_wmb/smp_rmb).
>>
>
> What kind of wording should we add here to make it
> easier to understand?
>
> "The TLB invalidation code reads these variables in
> the opposite order in which they are written" ?

Usually in such cases, you make a reference to wherever there are readers
that rely on the ordering. This is how documenting smp_wmb()/smp_rmb()
ordering is usually done.

>
>
>>>
>>> + /*
>>> + * If at least one CPU is not using the global ASID yet,
>>> + * send a TLB flush IPI. The IPI should cause stragglers
>>> + * to transition soon.
>>> + *
>>> + * This can race with the CPU switching to another task;
>>> + * that results in a (harmless) extra IPI.
>>> + */
>>> + if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) {
>>> + flush_tlb_multi(mm_cpumask(info->mm), info);
>>> + return;
>>> + }
>>
>> I am trying to figure out why we return here. The transition might
>> not
>> be over? Why is it "soon"? Wouldn't flush_tlb_func() reload it
>> unconditionally?
>
> The transition _should_ be over, but what if another
> CPU got an NMI while in the middle of switch_mm_irqs_off,
> and set its own bit in the mm_cpumask after we send this
> IPI?
>
> On the other hand, if it sets its mm_cpumask bit after
> this point, it will also load the mm->context.global_asid
> after this point, and should definitely get the new ASID.
>
> I think we are probably fine to set asid_transition to
> false here, but I've had to tweak this code so much over
> the past months that I don't feel super confident any more :)

I fully relate, but I am not sure it is that great. The problem
is that nobody would have the guts to change that code later...

>>
>> And I guess this is now a bug, if info->stride_shift > PMD_SHIFT...
>>
> We set maxnr to 1 for larger stride shifts at the top of the function:
>

You’re right, all safe.

>
>>> + goto reload_tlb;
>>
>> Not a fan of the goto's when they are not really needed, and I don't
>> think it is really needed here. Especially that the name of the tag
>> "reload_tlb" does not really convey that the page-tables are reloaded
>> at
>> that point.
>
> In this particular case, the CPU continues running with
> the same page tables, but with a different PCID.

I understand it is “reload_tlb” from your point of view, or from the
point of view of the code that does the “goto”, but if I showed you
the code that follows the “reload_tlb”, I’m not sure you’d know it
is so.

[ snip, taking your valid points ]

>>
>>> + loaded_mm_asid =
>>> this_cpu_read(cpu_tlbstate.loaded_mm_asid);
>>> + }
>>> +
>>> + /* Broadcast ASIDs are always kept up to date with
>>> INVLPGB. */
>>> + if (is_global_asid(loaded_mm_asid))
>>> + return;
>>
>> The comment does not clarify to me, and I don't manage to clearly
>> explain to myself, why it is guaranteed that all the IPI TLB flushes,
>> which were potentially issued before the transition, are not needed.
>>
> IPI TLB flushes that were issued before the transition went
> to the CPUs when they were using dynamic ASIDs (numbers 1-5).
>
> Reloading the TLB with a different PCID, even pointed at the
> same page tables, means that the TLB should load the
> translations fresh from the page tables, and not re-use any
> that it had previously loaded under a different PCID.
>

What about this scenario for instance?

CPU0 CPU1 CPU2
---- ---- ----
(1) use_global_asid(mm):
mm->context.asid_trans = T;
mm->context.global_asid = G;

(2) switch_mm(..., next=mm):
*Observes global_asid = G
=> loads CR3 with PCID=G
=> fills TLB under G.
TLB caches PTE[G, V] = P
(for some reason)

(3) flush_tlb_mm_range(mm):
*Sees global_asid == 0
(stale/old value)
=> flush_tlb_multi()
=> IPI flush for dyn.

(4) IPI arrives on CPU1:
flush_tlb_func(...):
is_global_asid(G)? yes,
skip invalidate; broadcast
flush assumed to cover it.

(5) IPI completes on CPU2:
Dyn. ASIDs are flushed,
but CPU1’s global ASID
was never invalidated!

(6) CPU1 uses stale TLB entries under ASID G.
TLB continues to use PTE[G, V] = P, as it
was not invalidated.