Re: [RFC PATCH v2] membarrier: expedited private command
From: Peter Zijlstra
Date: Fri Jul 28 2017 - 04:55:58 EST
On Thu, Jul 27, 2017 at 05:13:14PM -0400, Mathieu Desnoyers wrote:
> +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
> + struct mm_struct *oldmm)
That is a bit of a mouth-full...
> +{
> + if (!IS_ENABLED(CONFIG_MEMBARRIER))
> + return;
> + /*
> + * __schedule()
> + * finish_task_switch()
> + * if (mm)
> + * mmdrop(mm)
> + * atomic_dec_and_test()
*
> + * takes care of issuing a memory barrier when oldmm is
> + * non-NULL. We also don't need the barrier when switching to a
> + * kernel thread, nor when we switch between threads belonging
> + * to the same process.
> + */
> + if (likely(oldmm || !mm || mm == oldmm))
> + return;
> + /*
> + * When switching between processes, membarrier expedited
> + * private requires a memory barrier after we set the current
> + * task.
> + */
> + smp_mb();
> +}
And because of what it complements, I would have expected the callsite:
> @@ -2737,6 +2763,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
>
> mm = next->mm;
> oldmm = prev->active_mm;
> + membarrier_expedited_mb_after_set_current(mm, oldmm);
> /*
> * For paravirt, this is coupled with an exit in switch_to to
> * combine the page table reload and the switch backend into
to be in finish_task_switch(), something like:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e9785f7aed75..33f34a201255 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct *prev)
finish_arch_post_lock_switch();
fire_sched_in_preempt_notifiers(current);
+
+ /*
+ * For CONFIG_MEMBARRIER we need a full memory barrier after the
+ * rq->curr assignment. Not all architectures have one in either
+ * switch_to() or switch_mm() so we use (and complement) the one
+ * implied by mmdrop()'s atomic_dec_and_test().
+ */
if (mm)
mmdrop(mm);
+ else if (IS_ENABLED(CONFIG_MEMBARRIER))
+ smp_mb();
+
if (unlikely(prev_state == TASK_DEAD)) {
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);
I realize this is sub-optimal if we're switching to a kernel thread, so
it might want some work, then again, a whole bunch of architectures
don't in fact need this extra barrier at all.
> +static void membarrier_private_expedited(void)
> +{
> + int cpu, this_cpu;
> + bool fallback = false;
> + cpumask_var_t tmpmask;
> +
> + if (num_online_cpus() == 1)
> + return;
> +
> + /*
> + * Matches memory barriers around rq->curr modification in
> + * scheduler.
> + */
> + smp_mb(); /* system call entry is not a mb. */
> +
Weren't you going to put in a comment on that GFP_NOWAIT thing?
> + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
You really want: zalloc_cpumask_var().
> + /* Fallback for OOM. */
> + fallback = true;
> + }
> +
> + /*
> + * Skipping the current CPU is OK even through we can be
> + * migrated at any point. The current CPU, at the point where we
> + * read raw_smp_processor_id(), is ensured to be in program
> + * order with respect to the caller thread. Therefore, we can
> + * skip this CPU from the iteration.
> + */
> + this_cpu = raw_smp_processor_id();
So if instead you do the below, that is still true, but you have the
opportunity to skip moar CPUs, then again, if you migrate the wrong way
you'll end up not skipping yourself.. a well.
> + cpus_read_lock();
> + for_each_online_cpu(cpu) {
> + struct task_struct *p;
> +
if (cpu == raw_smp_processor_id())
continue;
> + rcu_read_lock();
> + p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> + if (p && p->mm == current->mm) {
> + if (!fallback)
> + __cpumask_set_cpu(cpu, tmpmask);
> + else
> + smp_call_function_single(cpu, ipi_mb, NULL, 1);
> + }
> + rcu_read_unlock();
> + }
> + cpus_read_unlock();
This ^, wants to go after that v
> + if (!fallback) {
> + smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
> + free_cpumask_var(tmpmask);
> + }
Because otherwise the bits in your tmpmask might no longer match the
online state.
> +
> + /*
> + * Memory barrier on the caller thread _after_ we finished
> + * waiting for the last IPI. Matches memory barriers around
> + * rq->curr modification in scheduler.
> + */
> + smp_mb(); /* exit from system call is not a mb */
> +}