Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Nicholas Piggin
Date: Mon Dec 28 2020 - 22:10:46 EST
Excerpts from Andy Lutomirski's message of December 29, 2020 10:56 am:
> On Mon, Dec 28, 2020 at 4:36 PM Nicholas Piggin <npiggin@xxxxxxxxx> wrote:
>>
>> Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am:
>> > On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers
>> > <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>> >>
>> >> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@xxxxxxxxxx wrote:
>> >>
>> >> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
>> >> > <linux@xxxxxxxxxxxxxxx> wrote:
>> >> >>
>> >> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
>> >> >> > After chatting with rmk about this (but without claiming that any of
>> >> >> > this is his opinion), based on the manpage, I think membarrier()
>> >> >> > currently doesn't really claim to be synchronizing caches? It just
>> >> >> > serializes cores. So arguably if userspace wants to use membarrier()
>> >> >> > to synchronize code changes, userspace should first do the code
>> >> >> > change, then flush icache as appropriate for the architecture, and
>> >> >> > then do the membarrier() to ensure that the old code is unused?
>> >>
>> >> ^ exactly, yes.
>> >>
>> >> >> >
>> >> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()
>> >> >> > syscall. That might cause you to end up with two IPIs instead of one
>> >> >> > in total, but we probably don't care _that_ much about extra IPIs on
>> >> >> > 32-bit arm?
>> >>
>> >> This was the original thinking, yes. The cacheflush IPI will flush specific
>> >> regions of code, and the membarrier IPI issues context synchronizing
>> >> instructions.
>>
>> APIs should be written in terms of the service they provide to
>> userspace, and in highest level terms as possible, rather than directing
>> hardware to do some low level operation. Unfortunately we're stuck with
>> this for now. We could deprecate it and replace it though.
>>
>> If userspace wants to modify code and ensure that after the system call
>> returns then no other thread will be executing the previous code, then
>> there should be an API for that. It could actually combine the two IPIs
>> into one for architectures that require both too.
>
> I agree. The membarrier API for SYNC_CORE is pretty nasty. I would
> much prefer a real API for JITs to use.
>
>>
>> >>
>> >> Architectures with coherent i/d caches don't need the cacheflush step.
>> >
>> > There are different levels of coherency -- VIVT architectures may have
>> > differing requirements compared to PIPT, etc.
>> >
>> > In any case, I feel like the approach taken by the documentation is
>> > fundamentally confusing. Architectures don't all speak the same
>> > language How about something like:
>> >
>> > The SYNC_CORE operation causes all threads in the caller's address
>> > space (including the caller) to execute an architecture-defined
>> > barrier operation. membarrier() will ensure that this barrier is
>> > executed at a time such that all data writes done by the calling
>> > thread before membarrier() are made visible by the barrier.
>> > Additional architecture-dependent cache management operations may be
>> > required to use this for JIT code.
>>
>> As said this isn't what SYNC_CORE does, and it's not what powerpc
>> context synchronizing instructions do either, it will very much
>> re-order visibility of stores around such an instruction.
>
> Perhaps the docs should be entirely arch-specific. It may well be
> impossible to state what it does in an arch-neutral way.
I think what I wrote above -- after the call returns, none of the
threads in the process will be executing instructions overwritten
previously by the caller (provided their i-caches are made coherent
with the caller's modifications).
>> A thread completes store instructions into a store queue, which is
>> as far as a context synchronizing event goes. Visibility comes at
>> some indeterminite time later.
>
> As currently implemented, it has the same visibility semantics as
> regular membarrier, too.
Ah I actually missed that SYNC_CORE is in _addition_ to the memory
barriers, and that's documented API too, not just implementation
sorry.
> So if I do:
>
> a = 1;
> membarrier(SYNC_CORE);
> b = 1;
>
> and another thread does:
>
> while (READ_ONCE(b) != 1)
> ;
> barrier();
> assert(a == 1);
Right that's true, due to the MEMBARRIER_CMD_PRIVATE_EXPEDITED. Neither
that barrier or the added SYNC_CORE behaviour imply visibility though.
>
> then the assertion will pass. Similarly, one can do this, I hope:
>
> memcpy(codeptr, [some new instructions], len);
> arch_dependent_cache_flush(codeptr, len);
> membarrier(SYNC_CORE);
> ready = 1;
>
> and another thread does:
>
> while (READ_ONCE(ready) != 1)
> ;
> barrier();
> (*codeptr)();
>
> arch_dependent_cache_flush is a nop on x86. On arm and arm64, it
> appears to be a syscall, although maybe arm64 can do it from
> userspace. I still don't know what it is on powerpc.
>
> Even using the term "cache" here is misleading. x86 chips have all
> kinds of barely-documented instruction caches, and they have varying
> degrees of coherency. The architecture actually promises that, if you
> do a certain incantation, then you get the desired result.
> membarrier() allows a user to do this incantation. But trying to
> replicate the incantation verbatim on an architecture like ARM is
> insufficient, and trying to flush the things that are documented as
> being caches on x86 is expensive and a complete waste of time on x86.
> When you write some JIT code, you do *not* want to flush it all the
> way to main memory, especially on CPUs don't have the ability to write
> back invalidating. (That's most CPUs.)
>
> Even on x86, I suspect that the various decoded insn caches are rather
> more coherent than documented, and I have questions in to Intel about
> this. No answers yet.
>
> So perhaps the right approach is to say that membarrier() helps you
> perform the architecture-specific sequence of steps needed to safely
> modify code. On x86, you use it like this. On arm64, you do this
> other thing. On powerpc, you do something else.
I think it should certainly be documented in terms of what guarantees
it provides to application, _not_ the kinds of instructions it may or
may not induce the core to execute. And if existing API can't be
re-documented sanely, then deprecatd and new ones added that DTRT.
Possibly under a new system call, if arch's like ARM want a range
flush and we don't want to expand the multiplexing behaviour of
membarrier even more (sigh).
>
>>
>> I would be surprised if x86's serializing instructions were different
>> than powerpc. rdtsc ordering or flushing stores to cache would be
>> surprising.
>>
>
> At the very least, x86 has several levels of what ARM might call
> "context synchronization" AFAICT. STAC, CLAC, and POPF do a form of
> context synchronization in that the changes they cause to the MMU take
> effect immediately, but they are not documented as synchronizing the
> instruction stream. "Serializing" instructions do all kinds of
> things, not all of which may be architecturally visible at all.
> MFENCE and LFENCE do various complicated things, and LFENCE has magic
> retroactive capabilities on old CPUs that were not documented when
> those CPUs were released. SFENCE does a different form of
> synchronization entirely. LOCK does something else. (The
> relationship between LOCK and MFENCE is confusing at best.) RDTSC
> doesn't serialize anything at all, but RDTSCP does provide a form of
> serialization that's kind of ilke LFENCE. It's a mess. Even the
> manuals are inconsistent about what "serialize" means. JMP has its
> own magic on x86, but only on very very old CPUs.
>
> The specific instruction that flushes everything into the coherency
> domain is SFENCE, and SFENCE is not, for normal purposes, needed for
> self- or cross-modifying code.
>
Good reason to avoid such language in the system call interface!
Thanks,
Nick