Re: [RFC][PATCH 1/5] mm: Rework {set,clear,mm}_tlb_flush_pending()

From: Peter Zijlstra
Date: Tue Aug 01 2017 - 12:39:27 EST

On Tue, Aug 01, 2017 at 02:14:19PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 01, 2017 at 10:02:45PM +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2017-08-01 at 11:31 +0100, Will Deacon wrote:
> > > Looks like that's what's currently relied upon:
> > >
> > > /* Clearing is done after a TLB flush, which also provides a barrier. */
> > >
> > > It also provides barrier semantics on arm/arm64. In reality, I suspect
> > > all archs have to provide some order between set_pte_at and flush_tlb_range
> > > which is sufficient to hold up clearing the flag. :/
> >
> > Hrm... not explicitely.
> >
> > Most archs (powerpc among them) have set_pte_at be just a dumb store,
> > so the only barrier it has is the surrounding PTL.
> >
> > Now flush_tlb_range() I assume has some internal strong barriers but
> > none of that is well defined or documented at all, so I suspect all
> > bets are off.
> Right.. but seeing how we're in fact relying on things here it might be
> time to go figure this out and document bits.
> *sigh*, I suppose its going to be me doing this.. :-)

So on the related question; does on_each_cpu() provide a full smp_mb(),
I think we can answer: yes.

on_each_cpu() does IPIs to all _other_ CPUs, and those IPIs are using
llist_add() which is cmpxchg() which implies smp_mb().

After that it runs the local function.

So we can see on_each_cpu() as doing a smp_mb() before running @func.

xtensa - it uses on_each_cpu() for TLB invalidates.

x86 - we use either on_each_cpu() (flush_tlb_all(),
flush_tlb_kernel_range()) or we use flush_tlb_mm_range() which does an
atomic_inc_return() at the very start. Not to mention that actually
flushing TLBs itself is a barrier. Arguably flush_tlb_mm_range() should
first do _others* and then self, because others will use
smp_call_function_many() and see above.

(TODO look into paravirt)

Tile - does mb() in flush_remote()

sparc32-smp !?

sparc64 -- nope, no-op functions, TLB flushes are contained inside the PTL.

sh - yes, per smp_call_function

s390 - has atomics when it flushes. ptep_modify_prot_start() can set
mm->flush_mm = 1, at which point flush_tlb_range() will actually do
something, in that case there will be a smp_mb as per the atomics.
Otherwise the TLB invalidate is contained inside the PTL.

powerpc - radix - PTESYNC
hash - flush inside PTL

parisc - has all PTE and TLB operations serialized using a global lock

nm10300 - *ugh* but yes, smp_call_function() for remote CPUs

mips - smp_call_function for remote CPUs

metag - mmio write

m32r - doesn't seem to have smp_mb()

ia64 - smp_call_function_*()

hexagon - HVM trap, no smp_mb()

blackfin - nommu

arm - dsb ish

arm64 - dsb ish

arc - no barrier

alpha - no barrier

Now the architectures that do not have a barrier, like alpha, arc,
metag, the PTL spin_unlock has a smp_mb, however I don't think that is
enough, because then the flush_tlb_range() might still be pending. That
said, these architectures probably don't have transparant huge pages so
it doesn't matter.

Still this is all rather unsatisfactory. Either we should define
flush_tlb*() to imply a barrier when its not a no-op (sparc64/ppc-hash)
or simply make clear_tlb_flush_pending() an smp_store_release().

I prefer the latter option.