Re: [RFC PATCH v2] membarrier: expedited private command

From: Mathieu Desnoyers
Date: Fri Jul 28 2017 - 11:34:22 EST


----- On Jul 28, 2017, at 4:55 AM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:

> On Thu, Jul 27, 2017 at 05:13:14PM -0400, Mathieu Desnoyers wrote:
>> +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
>> + struct mm_struct *oldmm)
>
> That is a bit of a mouth-full...
>
>> +{
>> + if (!IS_ENABLED(CONFIG_MEMBARRIER))
>> + return;
>> + /*
>> + * __schedule()
>> + * finish_task_switch()
>> + * if (mm)
>> + * mmdrop(mm)
>> + * atomic_dec_and_test()
> *
>> + * takes care of issuing a memory barrier when oldmm is
>> + * non-NULL. We also don't need the barrier when switching to a
>> + * kernel thread, nor when we switch between threads belonging
>> + * to the same process.
>> + */
>> + if (likely(oldmm || !mm || mm == oldmm))
>> + return;
>> + /*
>> + * When switching between processes, membarrier expedited
>> + * private requires a memory barrier after we set the current
>> + * task.
>> + */
>> + smp_mb();
>> +}
>
> And because of what it complements, I would have expected the callsite:
>
>> @@ -2737,6 +2763,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
>>
>> mm = next->mm;
>> oldmm = prev->active_mm;
>> + membarrier_expedited_mb_after_set_current(mm, oldmm);
>> /*
>> * For paravirt, this is coupled with an exit in switch_to to
>> * combine the page table reload and the switch backend into
>
> to be in finish_task_switch(), something like:
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e9785f7aed75..33f34a201255 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct
> *prev)
> finish_arch_post_lock_switch();
>
> fire_sched_in_preempt_notifiers(current);
> +
> + /*
> + * For CONFIG_MEMBARRIER we need a full memory barrier after the
> + * rq->curr assignment. Not all architectures have one in either
> + * switch_to() or switch_mm() so we use (and complement) the one
> + * implied by mmdrop()'s atomic_dec_and_test().
> + */
> if (mm)
> mmdrop(mm);
> + else if (IS_ENABLED(CONFIG_MEMBARRIER))
> + smp_mb();
> +
> if (unlikely(prev_state == TASK_DEAD)) {
> if (prev->sched_class->task_dead)
> prev->sched_class->task_dead(prev);
>
>
> I realize this is sub-optimal if we're switching to a kernel thread, so
> it might want some work, then again, a whole bunch of architectures
> don't in fact need this extra barrier at all.

As discussed on IRC, I plan to go instead for:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 01e3b881ab3a..dd677fb2ee92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2636,6 +2636,11 @@ static struct rq *finish_task_switch(struct task_struct *
prev)
vtime_task_switch(prev);
perf_event_task_sched_in(prev, current);
finish_lock_switch(rq, prev);
+ /*
+ * The membarrier system call requires a full memory barrier
+ * after storing to rq->curr, before going back to user-space.
+ */
+ smp_mb__after_unlock_lock();
finish_arch_post_lock_switch();

fire_sched_in_preempt_notifiers(current);

Which is free on most architectures, except those defining
CONFIG_ARCH_WEAK_RELEASE_ACQUIRE. CCing PPC maintainers.

>
>> +static void membarrier_private_expedited(void)
>> +{
>> + int cpu, this_cpu;
>> + bool fallback = false;
>> + cpumask_var_t tmpmask;
>> +
>> + if (num_online_cpus() == 1)
>> + return;
>> +
>> + /*
>> + * Matches memory barriers around rq->curr modification in
>> + * scheduler.
>> + */
>> + smp_mb(); /* system call entry is not a mb. */
>> +
>
> Weren't you going to put in a comment on that GFP_NOWAIT thing?

I only added it to the uapi header. Adding this to the implementation
too:

+ /*
+ * Expedited membarrier commands guarantee that they won't
+ * block, hence the GFP_NOWAIT allocation flag and fallback
+ * implementation.
+ */



>
>> + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
>
> You really want: zalloc_cpumask_var().

ok

>
>> + /* Fallback for OOM. */
>> + fallback = true;
>> + }
>> +
>> + /*
>> + * Skipping the current CPU is OK even through we can be
>> + * migrated at any point. The current CPU, at the point where we
>> + * read raw_smp_processor_id(), is ensured to be in program
>> + * order with respect to the caller thread. Therefore, we can
>> + * skip this CPU from the iteration.
>> + */
>> + this_cpu = raw_smp_processor_id();
>
> So if instead you do the below, that is still true, but you have the
> opportunity to skip moar CPUs, then again, if you migrate the wrong way
> you'll end up not skipping yourself.. a well.

Chances are better to skip more CPUs in face of migration if we do it
in the loop as you suggest. Will do.

>
>> + cpus_read_lock();
>> + for_each_online_cpu(cpu) {
>> + struct task_struct *p;
>> +
> if (cpu == raw_smp_processor_id())
> continue;
>
>> + rcu_read_lock();
>> + p = task_rcu_dereference(&cpu_rq(cpu)->curr);
>> + if (p && p->mm == current->mm) {
>> + if (!fallback)
>> + __cpumask_set_cpu(cpu, tmpmask);
>> + else
>> + smp_call_function_single(cpu, ipi_mb, NULL, 1);
>> + }
>> + rcu_read_unlock();
>> + }
>> + cpus_read_unlock();
>
> This ^, wants to go after that v
>
>> + if (!fallback) {
>> + smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
>> + free_cpumask_var(tmpmask);
>> + }
>
> Because otherwise the bits in your tmpmask might no longer match the
> online state.

Good point, thanks!

Mathieu

>
>> +
>> + /*
>> + * Memory barrier on the caller thread _after_ we finished
>> + * waiting for the last IPI. Matches memory barriers around
>> + * rq->curr modification in scheduler.
>> + */
>> + smp_mb(); /* exit from system call is not a mb */
> > +}

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com