Re: [RFC PATCH v3] membarrier: provide core serialization

From: Mathieu Desnoyers
Date: Fri Sep 01 2017 - 13:00:45 EST


----- On Sep 1, 2017, at 12:25 PM, Will Deacon will.deacon@xxxxxxx wrote:

> On Fri, Sep 01, 2017 at 12:10:07PM -0400, Mathieu Desnoyers wrote:
>> Add a new MEMBARRIER_FLAG_SYNC_CORE flag to the membarrier
>> system call. It allows membarrier to issue core serializing barriers in
>> addition to memory barriers on target threads whenever a membarrier
>> command is performed.
>>
>> It is relevant for reclaim of JIT code, which requires to issue core
>> serializing barriers on all threads running on behalf of a process
>> after ensuring the old code is not visible anymore, before re-using
>> memory for new code.
>>
>> The new MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED used with
>> MEMBARRIER_FLAG_SYNC_CORE flag registers the current process as
>> requiring core serialization. It may block. It can be used to ensure
>> MEMBARRIER_CMD_PRIVATE_EXPEDITED never blocks, even the first time it is
>> invoked by a process with the MEMBARRIER_FLAG_SYNC_CORE flag.
>>
>> * Scheduler Overhead Benchmarks
>>
>> Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
>> Linux v4.13-rc6
>>
>> Inter-thread scheduling
>> taskset 01 ./perf bench sched pipe -T
>>
>> Avg. usecs/op Std.Dev. usecs/op
>> Before this change: 2.55 0.10
>> With this change: 2.49 0.08
>> SYNC_CORE processes: 2.70 0.10
>>
>> Inter-process scheduling
>> taskset 01 ./perf bench sched pipe
>>
>> Before this change: 2.93 0.13
>> With this change: 2.93 0.13
>> SYNC_CORE processes: 3.20 0.06
>>
>> Changes since v2:
>> - Rename MEMBARRIER_CMD_REGISTER_SYNC_CORE to
>> MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED,
>
> I'm still not convinced that this registration step is needed (at least
> for arm, power and x86), but my previous comments were ignored.

I mistakenly thought that your previous comments were addressed in
other legs of the previous thread, sorry about that.

Let's take x86 as an example. The private expedited membarrier
command iterates on all cpu runqueues, checking if rq->curr->mm
match current->mm, and only IPI if it matches.

We can very well have a CPU for which the scheduler goes back
and forth between user-space thread and a kernel thread, in
which case the mm state is kept as is, and rq->curr->mm is
temporarily saved into rq->curr->active_mm.

This means that while that CPU is executing a kthread, we
won't send any IPI that that CPU, but it could then schedule
back a thread belonging to the original process, and then
we go back executing user-space code without having issued
any kind of core serializing barrier (assuming we return to
userspace with sysexit).

Now about arm64, given that as you say it issues a core serializing
barrier when returning to user-space, and has a strong barrier
in switch_to, this means that the explicit sync_core() in sched_in
is not needed.

However, AFAIU, arm64 does not guarantee consistent data and instruction
caches.

I'm actually trying to wrap my head around what would be the sequence
of operations of a JIT trying to reclaim memory. Can we combine
core serialization and instruction cache flushing into a single
system call invocation, or we need to split this into two separate
operations ?

The JIT reclaim usage scheme I envision is:

- userspace unpublish all reference to old code,
- userspace ensure no thread use the old code anymore,
- sys_membarrier
- for each executing threads
- issue core serializing barrier
- userspace use a separate system call to issue data cache flush for
the modified range
- sys_membarrier
- for each executing threads
- issue instruction cache flush

So my current thinking is that we may need to change the membarrier
system call so one command serializes the core, and a separate command
issues cache flush.

By the way, is there a system call on arm64 and arm32 allowing user-space
to flush a range of user data cache ?

>
>> - Introduce the "MEMBARRIER_FLAG_SYNC_CORE" flag.
>> - Introduce CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE, only implemented by
>> x86 32/64 initially.
>> - Introduce arch_membarrier_user_icache_flush, a no-op on x86 32/64,
>> which can be implemented on architectures with incoherent data and
>> instruction caches. It is associated with
>> CONFIG_ARCH_HAS_MEMBARRIER_USER_ICACHE_FLUSH.
>
> Given that MEMBARRIER_FLAG_SYNC_CORE is about flushing the internal CPU
> pipeline (iiuc), could we rename this so that it doesn't mention the
> I-cache, please? I-cache flushing is a very different operation on most
> architectures I'm aware of, and on arm64 it's even available to userspace
> (and broadcast in hardware to other cores).

I'm starting to think we may need to expose a separate membarrier commands
for core_sync and icache flush. Am I on the right path, or missing something
here ?

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com