Re: [RFC PATCH v2] membarrier: expedited private command
From: Nicholas Piggin
Date: Mon Jul 31 2017 - 20:36:04 EST
On Mon, 31 Jul 2017 23:20:59 +1000
Michael Ellerman <mpe@xxxxxxxxxxxxxx> wrote:
> Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:
>
> > On Fri, Jul 28, 2017 at 10:55:32AM +0200, Peter Zijlstra wrote:
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index e9785f7aed75..33f34a201255 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> >> finish_arch_post_lock_switch();
> >>
> >> fire_sched_in_preempt_notifiers(current);
> >> +
> >> + /*
> >> + * For CONFIG_MEMBARRIER we need a full memory barrier after the
> >> + * rq->curr assignment. Not all architectures have one in either
> >> + * switch_to() or switch_mm() so we use (and complement) the one
> >> + * implied by mmdrop()'s atomic_dec_and_test().
> >> + */
> >> if (mm)
> >> mmdrop(mm);
> >> + else if (IS_ENABLED(CONFIG_MEMBARRIER))
> >> + smp_mb();
> >> +
> >> if (unlikely(prev_state == TASK_DEAD)) {
> >> if (prev->sched_class->task_dead)
> >> prev->sched_class->task_dead(prev);
> >>
> >>
> >
> >> a whole bunch of architectures don't in fact need this extra barrier at all.
> >
> > In fact, I'm fairly sure its only PPC.
> >
> > Because only ARM64 and PPC actually implement ACQUIRE/RELEASE with
> > anything other than smp_mb() (for now, Risc-V is in this same boat and
> > MIPS could be if they ever sort out their fancy barriers).
> >
> > TSO archs use a regular STORE for RELEASE, but all their atomics imply a
> > smp_mb() and there are enough around to make one happen (typically
> > mm_cpumask updates).
> >
> > Everybody else, aside from ARM64 and PPC must use smp_mb() for
> > ACQUIRE/RELEASE.
> >
> > ARM64 has a super duper barrier in switch_to().
> >
> > Which only leaves PPC stranded.. but the 'good' news is that mpe says
> > they'll probably need a barrier in switch_mm() in any case.
>
> I may have been sleep deprived. We have a patch, probably soon to be
> merged, which will add a smp_mb() in switch_mm() but *only* when we add
> a CPU to mm_cpumask, ie. when we run on a CPU we haven't run on before.
>
> I'm not across membarrier enough to know if that's sufficient, but it
> seems unlikely?
Won't be sufficient, they need a barrier after assigning rq->curr.
It can be avoided when switching between threads with the same mm.
I would like to see how bad membarrier performance is if we made
that side heavy enough to avoid the barrier in context switch (e.g.,
by taking the rq locks, or using synchronize_sched_expedited -- on
a per-arch basis of course).
Is there some (realistic-ish) benchmark using membarrier we can
experiment with?
Thanks,
Nick