Re: [RFC PATCH v3] membarrier: provide core serialization

From: Will Deacon
Date: Fri Sep 01 2017 - 13:10:49 EST


Hi Mathieu,

On Fri, Sep 01, 2017 at 05:00:38PM +0000, Mathieu Desnoyers wrote:
> ----- On Sep 1, 2017, at 12:25 PM, Will Deacon will.deacon@xxxxxxx wrote:
>
> > On Fri, Sep 01, 2017 at 12:10:07PM -0400, Mathieu Desnoyers wrote:
> >> Add a new MEMBARRIER_FLAG_SYNC_CORE flag to the membarrier
> >> system call. It allows membarrier to issue core serializing barriers in
> >> addition to memory barriers on target threads whenever a membarrier
> >> command is performed.
> >>
> >> It is relevant for reclaim of JIT code, which requires to issue core
> >> serializing barriers on all threads running on behalf of a process
> >> after ensuring the old code is not visible anymore, before re-using
> >> memory for new code.
> >>
> >> The new MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED used with
> >> MEMBARRIER_FLAG_SYNC_CORE flag registers the current process as
> >> requiring core serialization. It may block. It can be used to ensure
> >> MEMBARRIER_CMD_PRIVATE_EXPEDITED never blocks, even the first time it is
> >> invoked by a process with the MEMBARRIER_FLAG_SYNC_CORE flag.
> >>
> >> * Scheduler Overhead Benchmarks
> >>
> >> Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
> >> Linux v4.13-rc6
> >>
> >> Inter-thread scheduling
> >> taskset 01 ./perf bench sched pipe -T
> >>
> >> Avg. usecs/op Std.Dev. usecs/op
> >> Before this change: 2.55 0.10
> >> With this change: 2.49 0.08
> >> SYNC_CORE processes: 2.70 0.10
> >>
> >> Inter-process scheduling
> >> taskset 01 ./perf bench sched pipe
> >>
> >> Before this change: 2.93 0.13
> >> With this change: 2.93 0.13
> >> SYNC_CORE processes: 3.20 0.06
> >>
> >> Changes since v2:
> >> - Rename MEMBARRIER_CMD_REGISTER_SYNC_CORE to
> >> MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED,
> >
> > I'm still not convinced that this registration step is needed (at least
> > for arm, power and x86), but my previous comments were ignored.
>
> I mistakenly thought that your previous comments were addressed in
> other legs of the previous thread, sorry about that.

No problem, thanks for replying this time!

> Let's take x86 as an example. The private expedited membarrier
> command iterates on all cpu runqueues, checking if rq->curr->mm
> match current->mm, and only IPI if it matches.
>
> We can very well have a CPU for which the scheduler goes back
> and forth between user-space thread and a kernel thread, in
> which case the mm state is kept as is, and rq->curr->mm is
> temporarily saved into rq->curr->active_mm.
>
> This means that while that CPU is executing a kthread, we
> won't send any IPI that that CPU, but it could then schedule
> back a thread belonging to the original process, and then
> we go back executing user-space code without having issued
> any kind of core serializing barrier (assuming we return to
> userspace with sysexit).

Right, ok. I forgot about Andy's sysexit optimisation on x86.

> Now about arm64, given that as you say it issues a core serializing
> barrier when returning to user-space, and has a strong barrier
> in switch_to, this means that the explicit sync_core() in sched_in
> is not needed.

Good, that's what I thought.

> However, AFAIU, arm64 does not guarantee consistent data and instruction
> caches.

Correct, but:

* On 32-bit arm, we have a syscall to do that (and this is already used by
JITs and things like __builtin_clear_cache)

* On arm64, cache maintenance instructions are directly available to
userspace

In both cases, the maintenance is broadcast by the hardware to all CPUs.
The only part that cannot be broadcast is the pipeline flush, which is
the part we need to do above and is implicit on exception return.

> I'm actually trying to wrap my head around what would be the sequence
> of operations of a JIT trying to reclaim memory. Can we combine
> core serialization and instruction cache flushing into a single
> system call invocation, or we need to split this into two separate
> operations ?

I think that cache-flushing and pipeline-flushing should be separated,
as they tend to be in the CPU architectures I'm familiar with.

> The JIT reclaim usage scheme I envision is:
>
> - userspace unpublish all reference to old code,
> - userspace ensure no thread use the old code anymore,
> - sys_membarrier
> - for each executing threads
> - issue core serializing barrier
> - userspace use a separate system call to issue data cache flush for
> the modified range
> - sys_membarrier
> - for each executing threads
> - issue instruction cache flush
>
> So my current thinking is that we may need to change the membarrier
> system call so one command serializes the core, and a separate command
> issues cache flush.

Yeah, and the sequence is slightly different I think, as we need the
pipeline flush to come *after* the I-cache invalidation (otherwise the
stale instructions can just be refetched).

If you're at LPC in a week's time, this might be a good thing to sit down
and bash our heads against (espec. if we can grab PPC and x86 folks too).

Will