Re: [PATCH 2/2] arm64: tlb: skip tlbi broadcast for single threaded TLB flushes

From: Andrea Arcangeli
Date: Mon Feb 10 2020 - 15:14:45 EST


Hello Catalin,

On Mon, Feb 10, 2020 at 05:51:06PM +0000, Catalin Marinas wrote:
> Relying om mm_users is not sufficient AFAICT. Let's say on CPU0 you have
> a kernel thread running with the previous user pgd and ASID set in
> ttbr0_el1. The mm_users would still be 1 since only mm_count is
> incremented in context_switch(). If the user thread now runs on CPU1, a
> local tlbi would only invalidate the TLBs on CPU1. However, CPU0 may
> still walk (speculatively) the user page tables.
>
> An example where this matters is a group of small pages converted to a
> huge page. If CPU0 already has some TLB entries for small pages in the
> group but, not being aware of a TLBI for the ptes in the range, may read
> a block pmd entry (huge page) and we end up with a TLB conflict on CPU0
> (CPU1 is fine since you do the local tlbi).
>
> There are other examples where this could go wrong as the hardware may
> keep intermediate pgtable entries in a walk cache. In the arm64 kernel
> we rely on something the architecture calls break-before-make for any
> page table updates and these need to be broadcast to other CPUs that may
> potentially have an entry in their TLB.
>
> It may be better if you used mm_cpumask to mark wherever an mm ever ran
> than relying on mm_users.

Agreed.

If we can use mm_cpumask to track where the mm ever run, then if I'm
not mistaken we could optimize also multithreaded processes in the
same way: if only one thread is running frequently and the others are
frequently sleeping, we could issue a single tlbi broadcast (modulo
invalidates of small virtual ranges).

In the meantime the below should be enough to address the concern you
raised of the proof of concept RFC patch.

I already experimented with mm_users == 1 earlier and it doesn't
change the benchmark results for the "best case" below.

(untested)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 772bbc45b867..a2d53b301f22 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -169,7 +169,8 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
unsigned long asid = __TLBI_VADDR(0, ASID(mm));

/* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */
- if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) {
+ if (current->mm == mm && atomic_read(&mm->mm_users) <= 1 &&
+ (system_uses_ttbr0_pan() || atomic_read(&mm->mm_count) == 1)) {
int cpu = get_cpu();

cpumask_setall(mm_cpumask(mm));
@@ -177,7 +178,9 @@ static inline void flush_tlb_mm(struct mm_struct *mm)

smp_mb();

- if (atomic_read(&mm->mm_users) <= 1) {
+ if (atomic_read(&mm->mm_users) <= 1 &&
+ (system_uses_ttbr0_pan() ||
+ atomic_read(&mm->mm_count) == 1)) {
dsb(nshst);
__tlbi(aside1, asid);
__tlbi_user(aside1, asid);
@@ -212,7 +215,8 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm));

/* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */
- if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) {
+ if (current->mm == mm && atomic_read(&mm->mm_users) <= 1 &&
+ (system_uses_ttbr0_pan() || atomic_read(&mm->mm_count) == 1)) {
int cpu = get_cpu();

cpumask_setall(mm_cpumask(mm));
@@ -220,7 +224,9 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,

smp_mb();

- if (atomic_read(&mm->mm_users) <= 1) {
+ if (atomic_read(&mm->mm_users) <= 1 &&
+ (system_uses_ttbr0_pan() ||
+ atomic_read(&mm->mm_count) == 1)) {
dsb(nshst);
__tlbi(vale1, addr);
__tlbi_user(vale1, addr);
@@ -264,7 +270,8 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
end = __TLBI_VADDR(end, asid);

/* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */
- if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) {
+ if (current->mm == mm && atomic_read(&mm->mm_users) <= 1 &&
+ (system_uses_ttbr0_pan() || atomic_read(&mm->mm_count) == 1)) {
int cpu = get_cpu();

cpumask_setall(mm_cpumask(mm));
@@ -272,7 +279,9 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,

smp_mb();

- if (atomic_read(&mm->mm_users) <= 1) {
+ if (atomic_read(&mm->mm_users) <= 1 &&
+ (system_uses_ttbr0_pan() ||
+ atomic_read(&mm->mm_count) == 1)) {
dsb(nshst);
for (addr = start; addr < end; addr += stride) {
if (last_level) {


> That's a pretty artificial test and it is indeed improved by this patch.
> However, it would be nice to have some real-world scenarios where this
> matters.

I don't know exactly how much we should rely on the hardware to snoop
the asid on NUMA. The hardware to fully optimize would need to
implement a replicated mm_cpumask bitflag for each asid and every CPU
would need to tell every other CPU which asid it is loading every time
it is loading it. Exactly what x86 does with mm_cpumask in software.

That is ideal, but is it an arch requirement to add the above in all
implementations?

The case I measured has a single socket so it's even simpler because
it could be optimized all in-core. Even with a single socket I'm not
sure what's going wrong in the chip: it felt like it's the engine that
does the broadcast that runs serially system wide and then all CPUs
have to wait on it.

Still your question if it'll make a difference in practice is a good
one and I don't have a sure answer yet. I suppose before doing more
benchmarking it's better to make a new version of this that uses
mm_cpumask to track where the asid was ever loaded as you suggested,
so that it will also optimize away tlbi broadcaasts from multithreaded
processes where only one thread is running frequently?

Thanks!
Andrea