Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)

From: Mathieu Desnoyers
Date: Mon Mar 16 2015 - 11:58:11 EST


----- Original Message -----
> From: "Mathieu Desnoyers" <mathieu.desnoyers@xxxxxxxxxxxx>
> To: "Peter Zijlstra" <peterz@xxxxxxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx, "KOSAKI Motohiro" <kosaki.motohiro@xxxxxxxxxxxxxx>, "Steven Rostedt"
> <rostedt@xxxxxxxxxxx>, "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>, "Nicholas Miell" <nmiell@xxxxxxxxxxx>,
> "Linus Torvalds" <torvalds@xxxxxxxxxxxxxxxxxxxx>, "Ingo Molnar" <mingo@xxxxxxxxxx>, "Alan Cox"
> <gnomes@xxxxxxxxxxxxxxxxxxx>, "Lai Jiangshan" <laijs@xxxxxxxxxxxxxx>, "Stephen Hemminger"
> <stephen@xxxxxxxxxxxxxxxxxx>, "Andrew Morton" <akpm@xxxxxxxxxxxxxxxxxxxx>, "Josh Triplett" <josh@xxxxxxxxxxxxxxxx>,
> "Thomas Gleixner" <tglx@xxxxxxxxxxxxx>, "David Howells" <dhowells@xxxxxxxxxx>, "Nick Piggin" <npiggin@xxxxxxxxx>
> Sent: Monday, March 16, 2015 11:43:56 AM
> Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
>
> ----- Original Message -----
> > From: "Peter Zijlstra" <peterz@xxxxxxxxxxxxx>
> > To: "Mathieu Desnoyers" <mathieu.desnoyers@xxxxxxxxxxxx>
> > Cc: linux-kernel@xxxxxxxxxxxxxxx, "KOSAKI Motohiro"
> > <kosaki.motohiro@xxxxxxxxxxxxxx>, "Steven Rostedt"
> > <rostedt@xxxxxxxxxxx>, "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>,
> > "Nicholas Miell" <nmiell@xxxxxxxxxxx>,
> > "Linus Torvalds" <torvalds@xxxxxxxxxxxxxxxxxxxx>, "Ingo Molnar"
> > <mingo@xxxxxxxxxx>, "Alan Cox"
> > <gnomes@xxxxxxxxxxxxxxxxxxx>, "Lai Jiangshan" <laijs@xxxxxxxxxxxxxx>,
> > "Stephen Hemminger"
> > <stephen@xxxxxxxxxxxxxxxxxx>, "Andrew Morton" <akpm@xxxxxxxxxxxxxxxxxxxx>,
> > "Josh Triplett" <josh@xxxxxxxxxxxxxxxx>,
> > "Thomas Gleixner" <tglx@xxxxxxxxxxxxx>, "David Howells"
> > <dhowells@xxxxxxxxxx>, "Nick Piggin" <npiggin@xxxxxxxxx>
> > Sent: Monday, March 16, 2015 10:19:39 AM
> > Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory
> > barrier (x86) (v12)
> >
> > On Sun, Mar 15, 2015 at 03:24:19PM -0400, Mathieu Desnoyers wrote:
> >
> > TL;DR
> >
> > > --- a/arch/x86/include/asm/mmu_context.h
> > > +++ b/arch/x86/include/asm/mmu_context.h
> > > @@ -45,6 +45,16 @@ static inline void switch_mm(struct mm_struct *prev,
> > > struct mm_struct *next,
> > > #endif
> > > cpumask_set_cpu(cpu, mm_cpumask(next));
> > >
> > > + /*
> > > + * smp_mb() between mm_cpumask set and following memory
> > > + * accesses to user-space addresses is required by
> > > + * sys_membarrier(). A smp_mb() is also needed between
> > > + * prior memory accesses and mm_cpumask clear. This
> > > + * ensures that all user-space address memory accesses
> > > + * performed by the current thread are in program order
> > > + * when the mm_cpumask is set. Implied by load_cr3.
> > > + */
> > > +
> > > /* Re-load page tables */
> > > load_cr3(next->pgd);
> > > trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> > > @@ -82,6 +92,13 @@ static inline void switch_mm(struct mm_struct *prev,
> > > struct mm_struct *next,
> > > * We were in lazy tlb mode and leave_mm disabled
> > > * tlb flush IPI delivery. We must reload CR3
> > > * to make sure to use no freed page tables.
> > > + *
> > > + * smp_mb() between mm_cpumask set and memory accesses
> > > + * to user-space addresses is required by
> > > + * sys_membarrier(). This ensures that all user-space
> > > + * address memory accesses performed by the current
> > > + * thread are in program order when the mm_cpumask is
> > > + * set. Implied by load_cr3.
> > > */
> > > load_cr3(next->pgd);
> > > trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> >
> >
> > In both cases the cpumask_set_cpu() will also imply a MB.
>
> I'm probably missing what exactly in cpumask_set_cpu()
> implies this guarantee. cpumask_set_cpu() uses set_bit().
> On x86, set_bit is indeed implemented with a lock-prefixed
> orb or bts. However, the comment above set_bit() states:
>
> * Note: there are no guarantees that this function will not be reordered
> * on non x86 architectures, so if you are writing portable code,
> * make sure not to rely on its reordering guarantees.
>
> And it states nothing about memory barriers. Typically,
> atomic ops that imply memory barriers always return
> something (xchg, cmpxchg, add_return). Ops like atomic_add
> do not imply barriers.
>
> >
> > > +enum {
> > > + /*
> > > + * Private flag set: only synchronize across a single process. If this
> > > + * flag is not set, it means "shared": synchronize across multiple
> > > + * processes. The shared mode is useful for shared memory mappings
> > > + * across processes.
> > > + */
> > > + MEMBARRIER_PRIVATE_FLAG = (1 << 0),
> > > +
> > > + /*
> > > + * Expedited flag set: adds some overhead, fast execution (few
> > > + * microseconds). If this flag is not set, it means "delayed": low
> > > + * overhead, but slow execution (few milliseconds).
> > > + */
> > > + MEMBARRIER_EXPEDITED_FLAG = (1 << 1),
> >
> >
> > I suppose this is an unprivileged syscall; so what do we do about:
> >
> > for (;;)
> > sys_membar(EXPEDITED);
> >
> > Which would spray the entire system with IPIs at break neck speed.
>
> Currently, combining EXPEDITED with non-PRIVATE returns -EINVAL.
> Therefore, if someone cares about issuing barriers on the entire
> system, the only option is to use non-EXPEDITED, which rely on
> synchronize_rcu().

Sorry, I meant "synchronize_sched()".

Thanks,

Mathieu

>
> The only way to invoke expedited barriers in a loop is:
>
> for (;;)
> sys_membarrier(MEMBARRIER_EXPEDITED | MEMBARRIER_PRIVATE);
>
> Which will only send IPIs to the CPU running threads from the same
> process.
>
> >
> > > +static void membarrier_ipi(void *unused)
> > > +{
> > > + /* Order memory accesses with respects to sys_membarrier caller. */
> > > + smp_mb();
> > > +}
> > > +
> > > +/*
> > > + * Handle out-of-memory by sending per-cpu IPIs instead.
> > > + */
> > > +static void membarrier_fallback(void)
> > > +{
> > > + struct mm_struct *mm;
> > > + int cpu;
> > > +
> > > + for_each_cpu(cpu, mm_cpumask(current->mm)) {
> > > + raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> > > + mm = cpu_curr(cpu)->mm;
> > > + raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> > > + if (current->mm == mm)
> > > + smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> > > + }
> > > +}
> >
> > > +static void membarrier_expedited(void)
> > > +{
> > > + struct mm_struct *mm;
> > > + cpumask_var_t tmpmask;
> > > + int cpu;
> > > +
> > > + /*
> > > + * Memory barrier on the caller thread between previous memory accesses
> > > + * to user-space addresses and sending memory-barrier IPIs. Orders all
> > > + * user-space address memory accesses prior to sys_membarrier() before
> > > + * mm_cpumask read and membarrier_ipi executions. This barrier is
> > > paired
> > > + * with memory barriers in:
> > > + * - membarrier_ipi() (for each running threads of the current process)
> > > + * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> > > + * accesses to user-space addresses)
> > > + * - Each CPU ->mm update performed with rq lock held by the scheduler.
> > > + * A memory barrier is issued each time ->mm is changed while the rq
> > > + * lock is held.
> > > + */
> > > + smp_mb();
> > > + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> > > + membarrier_fallback();
> > > + goto out;
> > > + }
> > > + cpumask_copy(tmpmask, mm_cpumask(current->mm));
> > > + preempt_disable();
> > > + cpumask_clear_cpu(smp_processor_id(), tmpmask);
> > > + for_each_cpu(cpu, tmpmask) {
> > > + raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> > > + mm = cpu_curr(cpu)->mm;
> > > + raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> > > + if (current->mm != mm)
> > > + cpumask_clear_cpu(cpu, tmpmask);
> > > + }
> > > + smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
> > > + preempt_enable();
> > > + free_cpumask_var(tmpmask);
> > > +out:
> > > + /*
> > > + * Memory barrier on the caller thread between sending & waiting for
> > > + * memory-barrier IPIs and following memory accesses to user-space
> > > + * addresses. Orders mm_cpumask read and membarrier_ipi executions
> > > + * before all user-space address memory accesses following
> > > + * sys_membarrier(). This barrier is paired with memory barriers in:
> > > + * - membarrier_ipi() (for each running threads of the current process)
> > > + * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> > > + * accesses to user-space addresses)
> > > + * - Each CPU ->mm update performed with rq lock held by the scheduler.
> > > + * A memory barrier is issued each time ->mm is changed while the rq
> > > + * lock is held.
> > > + */
> > > + smp_mb();
> > > +}
> >
> > Did you just write:
> >
> > bool membar_cpu_is_mm(int cpu, void *info)
> > {
> > struct mm_struct *mm = info;
> > struct rq *rq = cpu_rq(cpu);
> > bool ret;
> >
> > raw_spin_lock_irq(&rq->lock);
> > ret = rq->curr->mm == mm;
> > raw_spin_unlock_irq(&rq->lock);
> >
> > return ret;
> > }
> >
> > on_each_cpu_cond(membar_cpu_is_mm, membar_ipi, mm, 1, GFP_NOWAIT);
> >
>
> It is very similar indeed! The main difference is that my implementation
> was starting from a copy of mm_cpumask(current->mm) and clearing the CPUs
> for which TLB shootdown is simply pending (confirmed by taking the rq lock
> and checking cpu_curr(cpu)->mm against current->mm).
>
> Now that you mention this, I think we don't really need to use
> mm_cpumask(current->mm) at all. Just iterating on each cpu, taking
> the rq lock, and comparing the mm should be enough. This would
> remove the need to rely on having extra memory barriers around
> set/clear of the mm cpumask.
>
> The main reason why I did not use on_each_cpu_cond() was that it did
> not exist back in 2010. ;-)
>
> > On which; I absolutely hate that rq->lock thing in there. What is
> > 'wrong' with doing a lockless compare there? Other than not actually
> > being able to deref rq->curr of course, but we need to fix that anyhow.
>
> If we can make sure rq->curr deref could be done without holding the rq
> lock, then I think all we would need is to ensure that updates to rq->curr
> are surrounded by memory barriers. Therefore, we would have the following:
>
> * When a thread is scheduled out, a memory barrier would be issued before
> rq->curr is updated to the next thread task_struct.
>
> * Before a thread is scheduled in, a memory barrier needs to be issued
> after rq->curr is updated to the incoming thread.
>
> In order to be able to dereference rq->curr->mm without holding the
> rq->lock, do you envision we should protect task reclaim with RCU-sched ?
>
> Thanks!
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
>

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/