Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Will Deacon
Date: Tue Jan 05 2021 - 17:42:27 EST
On Tue, Jan 05, 2021 at 08:20:51AM -0800, Andy Lutomirski wrote:
> > On Jan 5, 2021, at 5:26 AM, Will Deacon <will@xxxxxxxxxx> wrote:
> > Sorry for the slow reply, I was socially distanced from my keyboard.
> >
> >> On Mon, Dec 28, 2020 at 04:36:11PM -0800, Andy Lutomirski wrote:
> >> On Mon, Dec 28, 2020 at 4:11 PM Nicholas Piggin <npiggin@xxxxxxxxx> wrote:
> >>>> +static inline void membarrier_sync_core_before_usermode(void)
> >>>> +{
> >>>> + /*
> >>>> + * XXX: I know basically nothing about powerpc cache management.
> >>>> + * Is this correct?
> >>>> + */
> >>>> + isync();
> >>>
> >>> This is not about memory ordering or cache management, it's about
> >>> pipeline management. Powerpc's return to user mode serializes the
> >>> CPU (aka the hardware thread, _not_ the core; another wrongness of
> >>> the name, but AFAIKS the HW thread is what is required for
> >>> membarrier). So this is wrong, powerpc needs nothing here.
> >>
> >> Fair enough. I'm happy to defer to you on the powerpc details. In
> >> any case, this just illustrates that we need feedback from a person
> >> who knows more about ARM64 than I do.
> >
> > I think we're in a very similar boat to PowerPC, fwiw. Roughly speaking:
> >
> > 1. SYNC_CORE does _not_ perform any cache management; that is the
> > responsibility of userspace, either by executing the relevant
> > maintenance instructions (arm64) or a system call (arm32). Crucially,
> > the hardware will ensure that this cache maintenance is broadcast
> > to all other CPUs.
>
> Is this guaranteed regardless of any aliases? That is, if I flush from
> one CPU at one VA and then execute the same physical address from another
> CPU at a different VA, does this still work?
The data side will be fine, but the instruction side can have virtual
aliases. We handle this in flush_ptrace_access() by blowing away the whole
I-cache if we're not physically-indexed, but userspace would be in trouble
if it wanted to handle this situation alone.
> > 2. Even with all the cache maintenance in the world, a CPU could have
> > speculatively fetched stale instructions into its "pipeline" ahead of
> > time, and these are _not_ flushed by the broadcast maintenance instructions
> > in (1). SYNC_CORE provides a means for userspace to discard these stale
> > instructions.
> >
> > 3. The context synchronization event on exception entry/exit is
> > sufficient here. The Arm ARM isn't very good at describing what it
> > does, because it's in denial about the existence of a pipeline, but
> > it does have snippets such as:
> >
> > (s/PE/CPU/)
> > | For all types of memory:
> > | The PE might have fetched the instructions from memory at any time
> > | since the last Context synchronization event on that PE.
> >
> > Interestingly, the architecture recently added a control bit to remove
> > this synchronisation from exception return, so if we set that then we'd
> > have a problem with SYNC_CORE and adding an ISB would be necessary (and
> > we could probable then make kernel->kernel returns cheaper, but I
> > suspect we're relying on this implicit synchronisation in other places
> > too).
> >
>
> Is ISB just a context synchronization event or does it do more?
That's a good question. Barrier instructions on ARM do tend to get
overloaded with extra behaviours over time, so it could certainly end up
doing the context synchronization event + extra stuff in future. Right now,
the only thing that springs to mind is the spectre-v1 heavy mitigation
barrier of 'DSB; ISB' which, for example, probably doesn't work for 'DSB;
ERET' because the ERET can be treated like a conditional (!) branch.
> On x86, it’s very hard to tell that MFENCE does any more than LOCK, but
> it’s much slower. And we have LFENCE, which, as documented, doesn’t
> appear to have any semantics at all. (Or at least it didn’t before
> Spectre.)
I tend to think of ISB as a front-end barrier relating to instruction fetch
whereas DMB, acquire/release and DSB are all back-end barriers relating to
memory accesses. You _can_ use ISB in conjunction with control dependencies
to order a pair of loads (like you can with ISYNC on Power), but it's a
really expensive way to do it.
> > Are you seeing a problem in practice, or did this come up while trying to
> > decipher the semantics of SYNC_CORE?
>
> It came up while trying to understand the code and work through various
> bugs in it. The code was written using something approximating x86
> terminology, but it was definitely wrong on x86 (at least if you believe
> the SDM, and I haven’t convinced any architects to say otherwise).
Ok, thanks.
Will