Re: [PATCH 7/8] membarrier: Remove arm (32) support for SYNC_CORE
From: Peter Zijlstra
Date: Thu Jun 17 2021 - 10:06:31 EST
On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 17, 2021, at 4:33 AM, Mark Rutland wrote:
> > Sure, and I agree we should not change cacheflush().
> > The point of membarrier(SYNC_CORE) is that you can move the cost of that
> > ISB out of the fast-path in the executing thread(s) and into the
> > slow-path on the thread which generated the code.
> > So e.g. rather than an executing thread always having to do:
> > LDR <reg>, [<funcptr>]
> > ISB // in case funcptr was just updated
> > BLR <reg>
> > ... you have the thread generating the code use membarrier(SYNC_CORE)
> > prior to plublishing the funcptr, and the fast-path on all the executing
> > threads can be:
> > LDR <reg> [<funcptr>]
> > BLR <reg>
> > ... and thus I think we still want membarrier(SYNC_CORE) so that people
> > can do this, even if there are other means to achieve the same
> > functionality.
> I had the impression that sys_cacheflush() did that. Am I wrong?
Yes, sys_cacheflush() only does what it says on the tin (and only
correctly for hardware broadcast -- everything except 11mpcore).
It only invalidates the caches, but not the per CPU derived state like
prefetch buffers and micro-op buffers, and certainly not instructions
already in flight.
So anything OoO needs at the very least a complete pipeline stall
injected, but probably something stronger to make it flush the buffers.
> In any event, I’m even more convinced that no new SYNC_CORE arches
> should be added. We need a new API that just does the right thing.
I really don't understand why you hate the thing so much; SYNC_CORE is a
means of injecting whatever instruction is required to flush all uarch
state related to instructions on all theads (not all CPUs) of a process
as efficient as possible.
The alternative is sending signals to all threads (including the
non-running ones) which is known to scale very poorly indeed, or, as
Mark suggests above, have very expensive instructions unconditinoally in
the instruction stream, which is also undesired.