Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE

From: Peter Zijlstra
Date: Thu Jun 17 2021 - 10:06:31 EST

On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 17, 2021, at 4:33 AM, Mark Rutland wrote:

> > Sure, and I agree we should not change cacheflush().
> >
> > The point of membarrier(SYNC_CORE) is that you can move the cost of that
> > ISB out of the fast-path in the executing thread(s) and into the
> > slow-path on the thread which generated the code.
> >
> > So e.g. rather than an executing thread always having to do:
> >
> > LDR <reg>, [<funcptr>]
> > ISB // in case funcptr was just updated
> > BLR <reg>
> >
> > ... you have the thread generating the code use membarrier(SYNC_CORE)
> > prior to plublishing the funcptr, and the fast-path on all the executing
> > threads can be:
> >
> > LDR <reg> [<funcptr>]
> > BLR <reg>
> >
> > ... and thus I think we still want membarrier(SYNC_CORE) so that people
> > can do this, even if there are other means to achieve the same
> > functionality.
> I had the impression that sys_cacheflush() did that. Am I wrong?

Yes, sys_cacheflush() only does what it says on the tin (and only
correctly for hardware broadcast -- everything except 11mpcore).

It only invalidates the caches, but not the per CPU derived state like
prefetch buffers and micro-op buffers, and certainly not instructions
already in flight.

So anything OoO needs at the very least a complete pipeline stall
injected, but probably something stronger to make it flush the buffers.

> In any event, I’m even more convinced that no new SYNC_CORE arches
> should be added. We need a new API that just does the right thing.

I really don't understand why you hate the thing so much; SYNC_CORE is a
means of injecting whatever instruction is required to flush all uarch
state related to instructions on all theads (not all CPUs) of a process
as efficient as possible.

The alternative is sending signals to all threads (including the
non-running ones) which is known to scale very poorly indeed, or, as
Mark suggests above, have very expensive instructions unconditinoally in
the instruction stream, which is also undesired.