Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes

From: Rik van Riel
Date: Mon Jan 20 2025 - 11:10:31 EST

Next message: Ilpo Järvinen: "Re: [PATCH] PCI: Batch BAR sizing operations"
Previous message: Wentao Liang: "[PATCH] xfs: Propagate errors from xfs_reflink_cancel_cow_range in xfs_dax_write_iomap_end"
In reply to: Nadav Amit: "Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes"
Next in thread: Nadav Amit: "Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 2025-01-20 at 16:02 +0200, Nadav Amit wrote:
>
>
> On 20/01/2025 4:40, Rik van Riel wrote:
> >
> > +static inline void broadcast_tlb_flush(struct flush_tlb_info
> > *info)
> > +{
> > + VM_WARN_ON_ONCE(1);
>
> Not sure why not the use VM_WARN_ONCE() instead with some more
> informative message (anyhow, a string is allocated for it).
>
VM_WARN_ON_ONCE only has a condition, not a message.

> >
> > +static u16 get_global_asid(void)
> > +{
> > + lockdep_assert_held(&global_asid_lock);
> > +
> > + do {
> > + u16 start = last_global_asid;
> > + u16 asid = find_next_zero_bit(global_asid_used,
> > MAX_ASID_AVAILABLE, start);
> > +
> > + if (asid >= MAX_ASID_AVAILABLE) {
> > + reset_global_asid_space();
> > + continue;
> > + }
>
> I think that unless something is awfully wrong, you are supposed to
> at
> most call reset_global_asid_space() once. So if that's the case, why
> not
> do it this way?
>
> Instead, you can get rid of the loop and just do:
>
> asid = find_next_zero_bit(global_asid_used,
> MAX_ASID_AVAILABLE, start);
>
> If you want, you can warn if asid >= MAX_ASID_AVAILABLE and have some
> fallback. But the loop, is just confusing in my opinion for no
> reason.

I can get rid of the loop. You're right that the code
can just call find_next_zero_bit after calling
reset_global_asid_space.

>
> > + /* Slower check to make sure. */
> > + for_each_cpu(cpu, mm_cpumask(mm)) {
> > + /* Skip the CPUs that aren't really running this
> > process. */
> > + if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
> > + continue;
>
> Then perhaps at least add a comment next to loaded_mm, that it's not
> private per-se, but rarely accessed by other cores?
>
I don't see any comment in struct tlb_state that
suggests it was ever private to begin with.

Which comment are you referring to that should
be edited?

> >
> > +
> > + /*
> > + * The transition from IPI TLB flushing, with a dynamic
> > ASID,
> > + * and broadcast TLB flushing, using a global ASID, uses
> > memory
> > + * ordering for synchronization.
> > + *
> > + * While the process has threads still using a dynamic
> > ASID,
> > + * TLB invalidation IPIs continue to get sent.
> > + *
> > + * This code sets asid_transition first, before assigning
> > the
> > + * global ASID.
> > + *
> > + * The TLB flush code will only verify the ASID transition
> > + * after it has seen the new global ASID for the process.
> > + */
> > + WRITE_ONCE(mm->context.asid_transition, true);
> > + WRITE_ONCE(mm->context.global_asid, get_global_asid());
>
> I know it is likely correct in practice (due to TSO memory model),
> but
> it is not clear, at least for me, how those write order affects the
> rest
> of the code. I managed to figure out how it relates to the reads in
> flush_tlb_mm_range() and native_flush_tlb_multi(), but I wouldn't say
> it
> is trivial and doesn't worth a comment (or smp_wmb/smp_rmb).
>

What kind of wording should we add here to make it
easier to understand?

"The TLB invalidation code reads these variables in
the opposite order in which they are written" ?

> > + /*
> > + * If at least one CPU is not using the global
> > ASID yet,
> > + * send a TLB flush IPI. The IPI should cause
> > stragglers
> > + * to transition soon.
> > + *
> > + * This can race with the CPU switching to another
> > task;
> > + * that results in a (harmless) extra IPI.
> > + */
> > + if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid,
> > cpu)) != bc_asid) {
> > + flush_tlb_multi(mm_cpumask(info->mm),
> > info);
> > + return;
>
> I am trying to figure out why we return here. The transition might
> not
> be over? Why is it "soon"? Wouldn't flush_tlb_func() reload it
> unconditionally?

The transition _should_ be over, but what if another
CPU got an NMI while in the middle of switch_mm_irqs_off,
and set its own bit in the mm_cpumask after we send this
IPI?

On the other hand, if it sets its mm_cpumask bit after
this point, it will also load the mm->context.global_asid
after this point, and should definitely get the new ASID.

I think we are probably fine to set asid_transition to
false here, but I've had to tweak this code so much over
the past months that I don't feel super confident any more :)

>
> > + /*
> > + * TLB flushes with INVLPGB are kicked off asynchronously.
> > + * The inc_mm_tlb_gen() guarantees page table updates are
> > done
> > + * before these TLB flushes happen.
> > + */
> > + if (info->end == TLB_FLUSH_ALL) {
> > + invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
> > + /* Do any CPUs supporting INVLPGB need PTI? */
> > + if (static_cpu_has(X86_FEATURE_PTI))
> > + invlpgb_flush_single_pcid_nosync(user_pcid
> > (asid));
> > + } else for (; addr < info->end; addr += nr << info-
> > >stride_shift) {
>
> I guess I was wrong, and do-while was cleaner here.
>
> And I guess this is now a bug, if info->stride_shift > PMD_SHIFT...
>
We set maxnr to 1 for larger stride shifts at the top of the function:

/* Flushing multiple pages at once is not supported with 1GB
pages. */
if (info->stride_shift > PMD_SHIFT)
maxnr = 1;

> [ I guess the cleanest way was to change get_flush_tlb_info to mask
> the
> low bits of start and end based on ((1ull << stride_shift) - 1). But
> whatever... ]

I'll change it back :)

I'm just happy this code is getting lots of attention,
and we're improving it with time.

> > @@ -573,6 +874,23 @@ void switch_mm_irqs_off(struct mm_struct
> > *unused, struct mm_struct *next,
> >    !cpumask_test_cpu(cpu,
> > mm_cpumask(next))))
> >    cpumask_set_cpu(cpu, mm_cpumask(next));
> >
> > + /*
> > + * Check if the current mm is transitioning to a
> > new ASID.
> > + */
> > + if (needs_global_asid_reload(next, prev_asid)) {
> > + next_tlb_gen = atomic64_read(&next-
> > >context.tlb_gen);
> > +
> > + choose_new_asid(next, next_tlb_gen,
> > &new_asid, &need_flush);
> > + goto reload_tlb;
>
> Not a fan of the goto's when they are not really needed, and I don't
> think it is really needed here. Especially that the name of the tag
> "reload_tlb" does not really convey that the page-tables are reloaded
> at
> that point.

In this particular case, the CPU continues running with
the same page tables, but with a different PCID.

>
> > + }
> > +
> > + /*
> > + * Broadcast TLB invalidation keeps this PCID up
> > to date
> > + * all the time.
> > + */
> > + if (is_global_asid(prev_asid))
> > + return;
>
> Hard for me to convince myself

When a process uses a global ASID, we always send
out TLB invalidations using INVLPGB.

The global ASID should always be up to date.

>
> > @@ -769,6 +1092,16 @@ static void flush_tlb_func(void *info)
> >    if (unlikely(loaded_mm == &init_mm))
> >    return;
> >
> > + /* Reload the ASID if transitioning into or out of a
> > global ASID */
> > + if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) {
> > + switch_mm_irqs_off(NULL, loaded_mm, NULL);
>
> I understand you want to reuse that logic, but it doesn't seem
> reasonable to me. It both doesn't convey what you want to do, and can
> lead to undesired operations - cpu_tlbstate_update_lam() for
> instance.
> Probably the impact on performance is minor, but it is an opening for
> future mistakes.

My worry with having a separate code path here is
that the separate code path could bit rot, and we
could introduce bugs that way.

I would rather have a tiny performance impact in
what is a rare code path, than a rare (and hard
to track down) memory corruption due to bit rot.

>
> > + loaded_mm_asid =
> > this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> > + }
> > +
> > + /* Broadcast ASIDs are always kept up to date with
> > INVLPGB. */
> > + if (is_global_asid(loaded_mm_asid))
> > + return;
>
> The comment does not clarify to me, and I don't manage to clearly
> explain to myself, why it is guaranteed that all the IPI TLB flushes,
> which were potentially issued before the transition, are not needed.
>
IPI TLB flushes that were issued before the transition went
to the CPUs when they were using dynamic ASIDs (numbers 1-5).

Reloading the TLB with a different PCID, even pointed at the
same page tables, means that the TLB should load the
translations fresh from the page tables, and not re-use any
that it had previously loaded under a different PCID.

--
All Rights Reversed.

Next message: Ilpo Järvinen: "Re: [PATCH] PCI: Batch BAR sizing operations"
Previous message: Wentao Liang: "[PATCH] xfs: Propagate errors from xfs_reflink_cancel_cow_range in xfs_dax_write_iomap_end"
In reply to: Nadav Amit: "Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes"
Next in thread: Nadav Amit: "Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]