Re: [RFC PATCH v2] membarrier: expedited private command
From: Mathieu Desnoyers
Date: Mon Jul 31 2017 - 21:33:58 EST
----- On Jul 31, 2017, at 8:35 PM, Nicholas Piggin npiggin@xxxxxxxxx wrote:
> On Mon, 31 Jul 2017 23:20:59 +1000
> Michael Ellerman <mpe@xxxxxxxxxxxxxx> wrote:
>
>> Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:
>>
>> > On Fri, Jul 28, 2017 at 10:55:32AM +0200, Peter Zijlstra wrote:
>> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> >> index e9785f7aed75..33f34a201255 100644
>> >> --- a/kernel/sched/core.c
>> >> +++ b/kernel/sched/core.c
>> >> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct
>> >> *prev)
>> >> finish_arch_post_lock_switch();
>> >>
>> >> fire_sched_in_preempt_notifiers(current);
>> >> +
>> >> + /*
>> >> + * For CONFIG_MEMBARRIER we need a full memory barrier after the
>> >> + * rq->curr assignment. Not all architectures have one in either
>> >> + * switch_to() or switch_mm() so we use (and complement) the one
>> >> + * implied by mmdrop()'s atomic_dec_and_test().
>> >> + */
>> >> if (mm)
>> >> mmdrop(mm);
>> >> + else if (IS_ENABLED(CONFIG_MEMBARRIER))
>> >> + smp_mb();
>> >> +
>> >> if (unlikely(prev_state == TASK_DEAD)) {
>> >> if (prev->sched_class->task_dead)
>> >> prev->sched_class->task_dead(prev);
>> >>
>> >>
>> >
>> >> a whole bunch of architectures don't in fact need this extra barrier at all.
>> >
>> > In fact, I'm fairly sure its only PPC.
>> >
>> > Because only ARM64 and PPC actually implement ACQUIRE/RELEASE with
>> > anything other than smp_mb() (for now, Risc-V is in this same boat and
>> > MIPS could be if they ever sort out their fancy barriers).
>> >
>> > TSO archs use a regular STORE for RELEASE, but all their atomics imply a
>> > smp_mb() and there are enough around to make one happen (typically
>> > mm_cpumask updates).
>> >
>> > Everybody else, aside from ARM64 and PPC must use smp_mb() for
>> > ACQUIRE/RELEASE.
>> >
>> > ARM64 has a super duper barrier in switch_to().
>> >
>> > Which only leaves PPC stranded.. but the 'good' news is that mpe says
>> > they'll probably need a barrier in switch_mm() in any case.
>>
>> I may have been sleep deprived. We have a patch, probably soon to be
>> merged, which will add a smp_mb() in switch_mm() but *only* when we add
>> a CPU to mm_cpumask, ie. when we run on a CPU we haven't run on before.
>>
>> I'm not across membarrier enough to know if that's sufficient, but it
>> seems unlikely?
>
> Won't be sufficient, they need a barrier after assigning rq->curr.
> It can be avoided when switching between threads with the same mm.
>
> I would like to see how bad membarrier performance is if we made
> that side heavy enough to avoid the barrier in context switch (e.g.,
> by taking the rq locks, or using synchronize_sched_expedited -- on
> a per-arch basis of course).
>
> Is there some (realistic-ish) benchmark using membarrier we can
> experiment with?
Hi Nick,
I have benchmark programs using membarrier in liburcu [1]. They
represent very heavy usage of membarrier (perhaps more extreme than
reality).
The master branch is at:
git://git.liburcu.org/userspace-rcu.git
Below is a test branch I quickly hacked to make urcu use membarrier
MEMBARRIER_CMD_PRIVATE_EXPEDITED:
https://github.com/compudj/userspace-rcu-dev/tree/test-mb-private
An example benchmark is under tests/benchmark
Usage: test_urcu nr_readers nr_writers duration (s) <OPTIONS>
Here is an example on my 16-core Intel machine (with hyperthreading, so don't
expect this benchmark to be very reliable) with 4 reader threads, 1 writer
thread. The writer thread in this benchmark issues a synchronize_rcu() at
each iteration, so typically 2 sys_membarrier calls. I'm compiling with
the configure flag "--disable-sys-membarrier-fallback" to have the full
read-side speed.
Using MEMBARRIER_CMD_SHARED:
$ ./test_urcu 4 1 10
SUMMARY [...] test_urcu testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 17522991852 nr_writes 249 nr_ops 17522992101
438'074'796 reads / reader thread / s
25 updates / s
Using MEMBARRIER_CMD_PRIVATE_EXPEDITED:
$ ./test_urcu 4 1 10
SUMMARY [...] testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 12709653290 nr_writes 799668 nr_ops 12710452958
317'741'332 reads / reader thread / s
79'966 updates / s
As a comparison, refer to the test_urcu_mb benchmark, which uses
a full memory barrier in rcu_read_lock and rcu_read_unlock, and
can therefore do without any sys_membarrier. This is our fallback
when sys_membarrier returns -ENOSYS.
$ ./test_urcu_mb 4 1 10
SUMMARY [...] test_urcu_mb testdur 10 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads 137290634 nr_writes 8624184 nr_ops 145914818
3'432'265 reads / reader thread / s
862'418 updates / s
You could try adapting my membarrier private-expedited patch to grab the
rq lock for each CPU to see the impact on the benchmark for your specific
architecture. I suspect that the overhead that will matter in that case
is not only the speed of the membarrier operation, but also its disruption
of the scheduler rq lock cachelines, which will have effect on other unrelated
workloads. You may need to benchmark another scheduler-intensive workload
executed in parallel with heavy use of sys_membarrier adapted to grab rq locks
to figure this one out.
Let me know if you have questions,
Perhaps Dave Watson will have benchmarks on hazard pointers.
Thanks,
Mathieu
[1] http://liburcu.org
>
> Thanks,
> Nick
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com