Re: [PATCH 3/3] arm64: tlb: skip tlbi broadcast

From: Catalin Marinas
Date: Mon Mar 09 2020 - 07:22:51 EST


Hi Andrea,

On Sun, Feb 23, 2020 at 02:25:20PM -0500, Andrea Arcangeli wrote:
> switch_mm(struct mm_struct *prev, struct mm_struct *next,
> struct task_struct *tsk)
> {
> - if (prev != next)
> - __switch_mm(next);
> + unsigned int cpu = smp_processor_id();
> +
> + if (!per_cpu(cpu_not_lazy_tlb, cpu)) {
> + per_cpu(cpu_not_lazy_tlb, cpu) = true;
> + atomic_inc(&next->context.nr_active_mm);
> + __switch_mm(next, cpu);
> + } else if (prev != next) {
> + atomic_inc(&next->context.nr_active_mm);
> + __switch_mm(next, cpu);
> + atomic_dec(&prev->context.nr_active_mm);
> + }

IIUC, nr_active_mm keeps track of how many instances of the current pgd
(TTBR0_EL1) are active.

> +enum tlb_flush_types tlb_flush_check(struct mm_struct *mm, unsigned int cpu)
> +{
> + if (atomic_read(&mm->context.nr_active_mm) <= 1) {
> + bool is_local = current->active_mm == mm &&
> + per_cpu(cpu_not_lazy_tlb, cpu);
> + cpumask_t *stale_cpumask = mm_cpumask(mm);
> + unsigned int next_zero = cpumask_next_zero(-1, stale_cpumask);
> + bool local_is_clear = false;
> + if (next_zero < nr_cpu_ids &&
> + (is_local && next_zero == cpu)) {
> + next_zero = cpumask_next_zero(next_zero, stale_cpumask);
> + local_is_clear = true;
> + }
> + if (next_zero < nr_cpu_ids) {
> + cpumask_setall(stale_cpumask);
> + local_is_clear = false;
> + }
> +
> + /*
> + * Enforce CPU ordering between the above
> + * cpumask_setall(mm_cpumask) and the below
> + * atomic_read(nr_active_mm).
> + */
> + smp_mb();
> +
> + if (likely(atomic_read(&mm->context.nr_active_mm)) <= 1) {
> + if (is_local) {
> + if (!local_is_clear)
> + cpumask_clear_cpu(cpu, stale_cpumask);
> + return TLB_FLUSH_LOCAL;
> + }
> + if (atomic_read(&mm->context.nr_active_mm) == 0)
> + return TLB_FLUSH_NO;
> + }
> + }
> + return TLB_FLUSH_BROADCAST;

And this code here can assume that if nr_active_mm <= 1, no broadcast is
necessary.

One concern I have is the ordering between TTBR0_EL1 update in
cpu_do_switch_mm() and the nr_active_mm, both on a different CPU. We
only have an ISB for context synchronisation on that CPU but I don't
think the architecture guarantees any relation between sysreg access and
the memory update. We have a DSB but that's further down in switch_to().

However, what worries me more is that you can now potentially do a TLB
shootdown without clearing the intermediate (e.g. VA to pte) walk caches
from the TLB. Even if the corresponding pgd and ASID are no longer
active on other CPUs, I'm not sure it's entirely safe to free (and
re-allocate) pages belonging to a pgtable without first flushing the
TLB. All the architecture spec states is that the software must first
clear the entry followed by TLBI (the break-before-make rules).

That said, the benchmark numbers are not very encouraging. Around 1%
improvement in a single run, it can as well be noise. Also something
like hackbench may also show a slight impact on the context switch path.
Maybe with a true NUMA machine with hundreds of CPUs we may see a
difference, but it depends on how well the TLBI is implemented.

Thanks.

--
Catalin