Re: [PATCH] introduce sys_membarrier(): process-wide memorybarrier (v10)

From: Mathieu Desnoyers
Date: Mon Apr 05 2010 - 15:11:14 EST


* Randy Dunlap (randy.dunlap@xxxxxxxxxx) wrote:
> On Mon, 5 Apr 2010 13:57:37 -0400 Mathieu Desnoyers wrote:
>
> > [ The last RFC submission got no reply, so I am submitting for real now, hoping
> > to hear something else than the sound of "crickets" (yeah, it's unsually warm in
> > Montreal for this time of the year). The only objections that remain are style
> > issues pointed out by Ingo. These are answered below. ]
> >
>
> more chirping.

:)

>
> ...
>
> > This patch applies on top of -tip based on 2.6.34-rc3.
> >
> > Here is an implementation of a new system call, sys_membarrier(), which
> > executes a memory barrier on all threads of the current process. It can be used
> > to distribute the cost of user-space memory barriers asymmetrically by
> > transforming pairs of memory barriers into pairs consisting of sys_membarrier()
> > and a compiler barrier. For synchronization primitives that distinguish between
> > read-side and write-side (e.g. userspace RCU, rwlocks), the read-side can be
> > accelerated significantly by moving the bulk of the memory barrier overhead to
> > the write-side.
> >
> > The first user of this system call is the "liburcu" Userspace RCU implementation
> > found at http://lttng.org/urcu. It aims at greatly simplifying and enhancing the
> > current implementation, which uses a scheme similar to the sys_membarrier(), but
> > based on signals sent to each reader thread.
>
> and what uses liburcu?

(adding to the next version changelog)

Liburcu is currently packaged in Debian, Ubuntu, Gentoo, and it is also being
packaged for Fedora. It is already being used by a few programs/libraries, and
given it's wide availability, we can expect more in a near future.

The first user of this library is the UST (Userspace Tracing) library; a port of
LTTng to a userspace.

http://lttng.org/ust

Modulo a few changes to port it to userspace, the kernel and user-space LTTng
should be expected to have similar performance, because they use essentially the
same lockless buffering scheme, described in chapter 5 of my thesis:

http://www.lttng.org/pub/thesis/desnoyers-dissertation-2009-12.pdf

Here is the impact of two additional memory barriers on the LTTng tracer fast
path:

Intel Core Xeon 2.0 GHz
LTTng probe writing 16-byte worth of data to the trace (+4 byte event header)
(execution of 200000 loops, therefore trace buffers are cache-hot)

119 ns per event

adding 2 memory barriers, one before and one after the tracepoint:

155 ns per event

So we have a 25% slowdown on the tracer fast path, which is quite significant
when it comes to trace heavy workloads. The slowdown ratio may change slightly
for non cache-hot scenarios, but I expect it to stay in the same range. Section
8.4 of my thesis discusses the overhead of cache-cold buffers (around 333 ns per
event rather than 119 ns). I expect the cost of the memory barriers to increase
too in a cache-cold scenario.

>
> > This patch mostly sits in kernel/sched.c (it needs to access struct rq). It is
> > based on kernel v2.6.34-rc3, and also applies fine to tip/master . I am
> > proposing it for merge for 2.6.35. I think the -tip tree would be the right one
> > to pick up this patch, as it touches sched.c.
> >
> ...
>
> >
> > This patch only adds the system calls to x86 32/64. See the sys_membarrier()
> > comments for memory barriers requirement in switch_mm() to port to other
> > architectures.
> >
> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> > Acked-by: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
> > Acked-by: Steven Rostedt <rostedt@xxxxxxxxxxx>
> > Acked-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> > CC: Nicholas Miell <nmiell@xxxxxxxxxxx>
> > CC: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> > CC: mingo@xxxxxxx
> > CC: laijs@xxxxxxxxxxxxxx
> > CC: dipankar@xxxxxxxxxx
> > CC: akpm@xxxxxxxxxxxxxxxxxxxx
> > CC: josh@xxxxxxxxxxxxxxxx
> > CC: dvhltc@xxxxxxxxxx
> > CC: niv@xxxxxxxxxx
> > CC: tglx@xxxxxxxxxxxxx
> > CC: peterz@xxxxxxxxxxxxx
> > CC: Valdis.Kletnieks@xxxxxx
> > CC: dhowells@xxxxxxxxxx
> > CC: Nick Piggin <npiggin@xxxxxxx>
> > CC: Chris Friesen <cfriesen@xxxxxxxxxx>
> > ---
> > arch/x86/ia32/ia32entry.S | 1
> > arch/x86/include/asm/mmu_context.h | 28 ++++-
> > arch/x86/include/asm/unistd_32.h | 3
> > arch/x86/include/asm/unistd_64.h | 2
> > arch/x86/kernel/syscall_table_32.S | 1
> > include/linux/Kbuild | 1
> > include/linux/membarrier.h | 47 ++++++++
> > kernel/sched.c | 194 +++++++++++++++++++++++++++++++++++++
> > 8 files changed, 274 insertions(+), 3 deletions(-)
> >
> > Index: linux.trees.git/arch/x86/include/asm/unistd_64.h
> > ===================================================================
> > --- linux.trees.git.orig/arch/x86/include/asm/unistd_64.h 2010-03-30 09:48:03.000000000 -0400
> > +++ linux.trees.git/arch/x86/include/asm/unistd_64.h 2010-04-05 13:46:16.000000000 -0400
> > @@ -663,6 +663,8 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt
> > __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
> > #define __NR_recvmmsg 299
> > __SYSCALL(__NR_recvmmsg, sys_recvmmsg)
> > +#define __NR_membarrier 300
> > +__SYSCALL(__NR_membarrier, sys_membarrier)
> >
> > #ifndef __NO_STUBS
> > #define __ARCH_WANT_OLD_READDIR
> > Index: linux.trees.git/kernel/sched.c
> > ===================================================================
> > --- linux.trees.git.orig/kernel/sched.c 2010-04-05 13:45:52.000000000 -0400
> > +++ linux.trees.git/kernel/sched.c 2010-04-05 13:53:49.000000000 -0400
> > @@ -71,6 +71,7 @@
> > #include <linux/debugfs.h>
> > #include <linux/ctype.h>
> > #include <linux/ftrace.h>
> > +#include <linux/membarrier.h>
> >
> > #include <asm/tlb.h>
> > #include <asm/irq_regs.h>
> > @@ -8951,6 +8952,199 @@ struct cgroup_subsys cpuacct_subsys = {
> > };
> > #endif /* CONFIG_CGROUP_CPUACCT */
> >
> > +#ifdef CONFIG_SMP
> > +
> > +/*
> > + * Execute a memory barrier on all active threads from the current process
> > + * on SMP systems. Do not rely on implicit barriers in IPI handler execution,
> > + * because batched IPI lists are synchronized with spinlocks rather than full
> > + * memory barriers. This is not the bulk of the overhead anyway, so let's stay
> > + * on the safe side.
> > + */
> > +static void membarrier_ipi(void *unused)
> > +{
> > + smp_mb();
> > +}
> > +
> > +/*
> > + * Handle out-of-mem by sending per-cpu IPIs instead.
> > + */
> > +static void membarrier_retry(void)
> > +{
> > + struct mm_struct *mm;
> > + int cpu;
> > +
> > + for_each_cpu(cpu, mm_cpumask(current->mm)) {
> > + raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> > + mm = cpu_curr(cpu)->mm;
> > + raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> > + if (current->mm == mm)
> > + smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> > + }
> > +}
> > +
> > +/*
> > + * sys_membarrier - issue memory barrier on current process running threads
> > + * @flags: One of these must be set:
> > + * MEMBARRIER_EXPEDITED
> > + * Adds some overhead, fast execution (few microseconds)
> > + * MEMBARRIER_DELAYED
> > + * Low overhead, but slow execution (few milliseconds)
> > + *
> > + * MEMBARRIER_QUERY
> > + * This optional flag can be set to query if the kernel supports
> > + * a set of flags.
> > + *
> > + * return values: Returns -EINVAL if the flags are incorrect. Testing for kernel
> > + * sys_membarrier support can be done by checking for -ENOSYS return value.
> > + * Return values >= 0 indicate success. For a given set of flags on a given
> > + * kernel, this system call will always return the same value. It is therefore
> > + * correct to check the return value only once at library load, passing the
>
> library load assumes caller is a library? does the kernel care about that?

Nope, it doesn't. Will rephrase:

... It is therefore
* correct to check the return value only once during a process lifetime,
* passing the MEMBARRIER_QUERY flag in addition to only check if the flags are
* supported, without performing any synchronization.


>
>
> > + * MEMBARRIER_QUERY flag in addition to only check if the flags are supported,
> > + * without performing any synchronization.
> > + *
> > + * This system call executes a memory barrier on all running threads of the
> > + * current process. Upon completion, the caller thread is ensured that all
> > + * process threads have passed through a state where all memory accesses to
> > + * user-space addresses match program order. (non-running threads are de facto
> > + * in such a state)
>
> state.)
>

ok

> > + *
> > + * Using the non-expedited mode is recommended for applications which can
> > + * afford leaving the caller thread waiting for a few milliseconds. A good
> > + * example would be a thread dedicated to execute RCU callbacks, which waits
> > + * for callbacks to enqueue most of the time anyway.
> > + *
> > + * The expedited mode is recommended whenever the application needs to have
> > + * control returning to the caller thread as quickly as possible. An example
> > + * of such application would be one which uses the same thread to perform
> > + * data structure updates and issue the RCU synchronization.
> > + *
> > + * It is perfectly safe to call both expedited and non-expedited
> > + * sys_membarrier() in a process.
> > + *
> > + * mm_cpumask is used as an approximation of the processors which run threads
> > + * belonging to the current process. It is a superset of the cpumask to which we
> > + * must send IPIs, mainly due to lazy TLB shootdown. Therefore, for each CPU in
> > + * the mm_cpumask, we check each runqueue with the rq lock held to make sure our
> > + * ->mm is indeed running on them. The rq lock ensures that a memory barrier is
> > + * issued each time the rq current task is changed. This reduces the risk of
> > + * disturbing a RT task by sending unnecessary IPIs. There is still a slight
> > + * chance to disturb an unrelated task, because we do not lock the runqueues
> > + * while sending IPIs, but the real-time effect of this heavy locking would be
> > + * worse than the comparatively small disruption of an IPI.
> > + *
> > + * RED PEN: before assinging a system call number for sys_membarrier() to an
>
> assigning

ok

>
> > + * architecture, we must ensure that switch_mm issues full memory barriers
> > + * (or a synchronizing instruction having the same effect) between:
> > + * - memory accesses to user-space addresses and clear mm_cpumask.
> > + * - set mm_cpumask and memory accesses to user-space addresses.
> > + *
> > + * The reason why these memory barriers are required is that mm_cpumask updates,
> > + * as well as iteration on the mm_cpumask, offer no ordering guarantees.
> > + * These added memory barriers ensure that any thread modifying the mm_cpumask
> > + * is in a state where all memory accesses to user-space addresses are
> > + * guaranteed to be in program order.
> > + *
> > + * In some case adding a comment to this effect will suffice, in others we
> > + * will need to add smp_mb__before_clear_bit()/smp_mb__after_clear_bit() or
> > + * simply smp_mb(). These barriers are required to ensure we do not _miss_ a
> > + * CPU that need to receive an IPI, which would be a bug.
> > + *
> > + * On uniprocessor systems, this system call simply returns 0 without doing
> > + * anything, so user-space knows it is implemented.
> > + *
> > + * The flags argument has room for extensibility, with 16 lower bits holding
> > + * mandatory flags for which older kernels will fail if they encounter an
> > + * unknown flag. The high 16 bits are used for optional flags, which older
> > + * kernels don't have to care about.
> > + *
> > + * This synchronization only takes care of threads using the current process
> > + * memory map. It should not be used to synchronize accesses performed on memory
> > + * maps shared between different processes.
> > + */
> > +SYSCALL_DEFINE1(membarrier, unsigned int, flags)
> > +{
> > + struct mm_struct *mm;
> > + cpumask_var_t tmpmask;
> > + int cpu;
> > +
> > + /*
> > + * Expect _only_ one of expedited or delayed flags.
> > + * Don't care about optional mask for now.
> > + */
> > + switch (flags & MEMBARRIER_MANDATORY_MASK) {
> > + case MEMBARRIER_EXPEDITED:
> > + case MEMBARRIER_DELAYED:
> > + break;
> > + default:
> > + return -EINVAL;
> > + }
> > + if (unlikely(flags & MEMBARRIER_QUERY
> > + || thread_group_empty(current))
> > + || num_online_cpus() == 1)
> > + return 0;
> > + if (flags & MEMBARRIER_DELAYED) {
> > + synchronize_sched();
> > + return 0;
> > + }
> > + /*
> > + * Memory barrier on the caller thread between previous memory accesses
> > + * to user-space addresses and sending memory-barrier IPIs. Orders all
> > + * user-space address memory accesses prior to sys_membarrier() before
> > + * mm_cpumask read and membarrier_ipi executions. This barrier is paired
> > + * with memory barriers in:
> > + * - membarrier_ipi() (for each running threads of the current process)
> > + * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> > + * accesses to user-space addresses)
> > + * - Each CPU ->mm update performed with rq lock held by the scheduler.
> > + * A memory barrier is issued each time ->mm is changed while the rq
> > + * lock is held.
> > + */
> > + smp_mb();
> > + if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> > + membarrier_retry();
> > + goto out;
> > + }
> > + cpumask_copy(tmpmask, mm_cpumask(current->mm));
> > + preempt_disable();
> > + cpumask_clear_cpu(smp_processor_id(), tmpmask);
> > + for_each_cpu(cpu, tmpmask) {
> > + raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> > + mm = cpu_curr(cpu)->mm;
> > + raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> > + if (current->mm != mm)
> > + cpumask_clear_cpu(cpu, tmpmask);
> > + }
> > + smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
> > + preempt_enable();
> > + free_cpumask_var(tmpmask);
> > +out:
> > + /*
> > + * Memory barrier on the caller thread between sending&waiting for
>
> sending & waiting
>

ok

> > + * memory-barrier IPIs and following memory accesses to user-space
> > + * addresses. Orders mm_cpumask read and membarrier_ipi executions
> > + * before all user-space address memory accesses following
> > + * sys_membarrier(). This barrier is paired with memory barriers in:
> > + * - membarrier_ipi() (for each running threads of the current process)
> > + * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> > + * accesses to user-space addresses)
> > + * - Each CPU ->mm update performed with rq lock held by the scheduler.
> > + * A memory barrier is issued each time ->mm is changed while the rq
> > + * lock is held.
> > + */
> > + smp_mb();
> > + return 0;
> > +}
> > +
> > +#else /* #ifdef CONFIG_SMP */
>
> I don't know that we have a known convention for that, but I would use:
>
> #else /* not CONFIG_SMP */
>
> or
>
> #else /* !CONFIG_SMP */
>
> > +
> > +SYSCALL_DEFINE1(membarrier, unsigned int, flags)
> > +{
> > + return 0;
> > +}
> > +
> > +#endif /* #else #ifdef CONFIG_SMP */
>
> and:
>
> #endif /* CONFIG_SMP */
>
> The "#else #ifdef" is both ugly and too wordy IMO.

OK. I merely followed the coding style of the code added to sched.c by Paul
McKenney. But you are right, the rest of sched.c does as you say:

#else /* !CONFIG_SMP */

#endif /* CONFIG_SMP */

Will fix.

Thanks !

Mathieu

>
> > +
> > #ifndef CONFIG_SMP
> >
> > int rcu_expedited_torture_stats(char *page)
>
>
> ---
> ~Randy
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/