Re: [PATCH 1/3 v19] sys_membarrier(): system-wide memory barrier (generic, x86)

From: Michael Kerrisk (man-pages)
Date: Fri Dec 11 2015 - 13:05:59 EST


Hi Matthew,

On 12/05/2015 09:48 AM, Mathieu Desnoyers wrote:
> Hi Michael,
>
> Please find the membarrier man groff file attached. I re-integrated
> some changes that went in initially only in the changelog text version
> back onto this groff source.
>
> Please let me know if you find any issue with it.

Thanks for the page, but there's a few issues. Could you please
submit a new version as an inline patch, and see what can be
done w.r.t. the following points (see man-pages(7) for some
background on some of these points):

* Start DESCRIPTION off with a paragraph explaining what this system
call is about and why one would use it.

* Page needs VERSIONS, CONFORMING TO, and SEE ALSO sections.

* Is its possible to add a small EXAMPLE?

* In a NOTES section, it might be helpful to briefly explain the following
concepts: memory barrier and program order.

Some comments on individual pieces below:

> .TH MEMBARRIER 2 2015-04-15 "Linux" "Linux Programmer's Manual"
> .SH NAME
> membarrier \- issue memory barriers on a set of threads
> .SH SYNOPSIS
> .B #include <linux/membarrier.h>
> .sp
> .BI "int membarrier(int " cmd ", int " flags ");
> .sp
> .SH DESCRIPTION
> The
> .I cmd
> argument is one of the following:
>
> .TP
> .B MEMBARRIER_CMD_QUERY
> Query the set of supported commands. It returns a bitmask of supported
> commands.

Not clear here. Does this mean that the 'cmd' argument is a bit mask,
rather than an enumeration? I think that needs to be spelled out.
Also, the text should mention that the returned bitmask excludes
MEMBARRIER_CMD_QUERY. (Why, actually?)

> .TP
> .B MEMBARRIER_CMD_SHARED
> Execute a memory barrier on all threads running on the system.

All threads on the system?

> Upon
> return from system call, the caller thread is ensured that all running
> threads have passed through a state where all memory accesses to
> user-space addresses match program order between entry to and return
> from the system call (non-running threads are de facto in such a
> state). This covers threads from all processes running on the system.
> This command returns 0.
>
> .PP
> The
> .I flags
> argument is currently unused.
>
> .PP
> All memory accesses performed in program order from each targeted thread

What is a "targeted thread"? Some rewording is needed here.

> is guaranteed to be ordered with respect to sys_membarrier(). If we use
> the semantic "barrier()" to represent a compiler barrier forcing memory
> accesses to be performed in program order across the barrier, and
> smp_mb() to represent explicit memory barriers forcing full memory
> ordering across the barrier, we have the following ordering table for
> each pair of barrier(), sys_membarrier() and smp_mb():
>
> The pair ordering is detailed as (O: ordered, X: not ordered):
>
> barrier() smp_mb() sys_membarrier()
> barrier() X X O
> smp_mb() X O O
> sys_membarrier() O O O
>
> .SH RETURN VALUE
> On success, these system calls return zero.

This sentence seems out of place. We have one system call.
And the different operations described above return
nonzero values on success.

> On error, \-1 is returned,
> and
> .I errno
> is set appropriately.
> For a given command, with flags argument set to 0, this system call is
> guaranteed to always return the same value until reboot.

I don't understand the intent of the last sentence. What idea are you
trying to convey?

> .SH ERRORS
> .TP
> .B ENOSYS
> System call is not implemented.
> .TP
> .B EINVAL
> Invalid arguments.

Would be clearer to say here: "cmd is invalid or flags is nonezero"

Thanks,

Michael


> ----- On Dec 4, 2015, at 4:44 PM, Michael Kerrisk mtk.manpages@xxxxxxxxx wrote:
>
>> Hi Mathieu,
>>
>> In the patch below you have a man page type of text. Is that
>> just plain text, or do you have some groff source somewhere?
>>
>> Thanks,
>>
>> Michael
>>
>>
>> On 07/10/2015 10:58 PM, Mathieu Desnoyers wrote:
>>> Here is an implementation of a new system call, sys_membarrier(), which
>>> executes a memory barrier on all threads running on the system. It is
>>> implemented by calling synchronize_sched(). It can be used to distribute
>>> the cost of user-space memory barriers asymmetrically by transforming
>>> pairs of memory barriers into pairs consisting of sys_membarrier() and a
>>> compiler barrier. For synchronization primitives that distinguish
>>> between read-side and write-side (e.g. userspace RCU [1], rwlocks), the
>>> read-side can be accelerated significantly by moving the bulk of the
>>> memory barrier overhead to the write-side.
>>>
>>> The existing applications of which I am aware that would be improved by this
>>> system call are as follows:
>>>
>>> * Through Userspace RCU library (http://urcu.so)
>>> - DNS server (Knot DNS) https://www.knot-dns.cz/
>>> - Network sniffer (http://netsniff-ng.org/)
>>> - Distributed object storage (https://sheepdog.github.io/sheepdog/)
>>> - User-space tracing (http://lttng.org)
>>> - Network storage system (https://www.gluster.org/)
>>> - Virtual routers
>>> (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
>>> - Financial software (https://lkml.org/lkml/2015/3/23/189)
>>>
>>> Those projects use RCU in userspace to increase read-side speed and
>>> scalability compared to locking. Especially in the case of RCU used
>>> by libraries, sys_membarrier can speed up the read-side by moving the
>>> bulk of the memory barrier cost to synchronize_rcu().
>>>
>>> * Direct users of sys_membarrier
>>> - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
>>>
>>> Microsoft core dotnet GC developers are planning to use the mprotect()
>>> side-effect of issuing memory barriers through IPIs as a way to implement
>>> Windows FlushProcessWriteBuffers() on Linux. They are referring to
>>> sys_membarrier in their github thread, specifically stating that
>>> sys_membarrier() is what they are looking for.
>>>
>>> This implementation is based on kernel v4.1-rc8.
>>>
>>> To explain the benefit of this scheme, let's introduce two example threads:
>>>
>>> Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
>>> Thread B (frequent, e.g. executing liburcu
>>> rcu_read_lock()/rcu_read_unlock())
>>>
>>> In a scheme where all smp_mb() in thread A are ordering memory accesses
>>> with respect to smp_mb() present in Thread B, we can change each
>>> smp_mb() within Thread A into calls to sys_membarrier() and each
>>> smp_mb() within Thread B into compiler barriers "barrier()".
>>>
>>> Before the change, we had, for each smp_mb() pairs:
>>>
>>> Thread A Thread B
>>> previous mem accesses previous mem accesses
>>> smp_mb() smp_mb()
>>> following mem accesses following mem accesses
>>>
>>> After the change, these pairs become:
>>>
>>> Thread A Thread B
>>> prev mem accesses prev mem accesses
>>> sys_membarrier() barrier()
>>> follow mem accesses follow mem accesses
>>>
>>> As we can see, there are two possible scenarios: either Thread B memory
>>> accesses do not happen concurrently with Thread A accesses (1), or they
>>> do (2).
>>>
>>> 1) Non-concurrent Thread A vs Thread B accesses:
>>>
>>> Thread A Thread B
>>> prev mem accesses
>>> sys_membarrier()
>>> follow mem accesses
>>> prev mem accesses
>>> barrier()
>>> follow mem accesses
>>>
>>> In this case, thread B accesses will be weakly ordered. This is OK,
>>> because at that point, thread A is not particularly interested in
>>> ordering them with respect to its own accesses.
>>>
>>> 2) Concurrent Thread A vs Thread B accesses
>>>
>>> Thread A Thread B
>>> prev mem accesses prev mem accesses
>>> sys_membarrier() barrier()
>>> follow mem accesses follow mem accesses
>>>
>>> In this case, thread B accesses, which are ensured to be in program
>>> order thanks to the compiler barrier, will be "upgraded" to full
>>> smp_mb() by synchronize_sched().
>>>
>>> * Benchmarks
>>>
>>> On Intel Xeon E5405 (8 cores)
>>> (one thread is calling sys_membarrier, the other 7 threads are busy
>>> looping)
>>>
>>> 1000 non-expedited sys_membarrier calls in 33s = 33 milliseconds/call.
>>>
>>> * User-space user of this system call: Userspace RCU library
>>>
>>> Both the signal-based and the sys_membarrier userspace RCU schemes
>>> permit us to remove the memory barrier from the userspace RCU
>>> rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
>>> accelerating them. These memory barriers are replaced by compiler
>>> barriers on the read-side, and all matching memory barriers on the
>>> write-side are turned into an invocation of a memory barrier on all
>>> active threads in the process. By letting the kernel perform this
>>> synchronization rather than dumbly sending a signal to every process
>>> threads (as we currently do), we diminish the number of unnecessary wake
>>> ups and only issue the memory barriers on active threads. Non-running
>>> threads do not need to execute such barrier anyway, because these are
>>> implied by the scheduler context switches.
>>>
>>> Results in liburcu:
>>>
>>> Operations in 10s, 6 readers, 2 writers:
>>>
>>> memory barriers in reader: 1701557485 reads, 2202847 writes
>>> signal-based scheme: 9830061167 reads, 6700 writes
>>> sys_membarrier: 9952759104 reads, 425 writes
>>> sys_membarrier (dyn. check): 7970328887 reads, 425 writes
>>>
>>> The dynamic sys_membarrier availability check adds some overhead to
>>> the read-side compared to the signal-based scheme, but besides that,
>>> sys_membarrier slightly outperforms the signal-based scheme. However,
>>> this non-expedited sys_membarrier implementation has a much slower grace
>>> period than signal and memory barrier schemes.
>>>
>>> Besides diminishing the number of wake-ups, one major advantage of the
>>> membarrier system call over the signal-based scheme is that it does not
>>> need to reserve a signal. This plays much more nicely with libraries,
>>> and with processes injected into for tracing purposes, for which we
>>> cannot expect that signals will be unused by the application.
>>>
>>> An expedited version of this system call can be added later on to speed
>>> up the grace period. Its implementation will likely depend on reading
>>> the cpu_curr()->mm without holding each CPU's rq lock.
>>>
>>> This patch adds the system call to x86 and to asm-generic.
>>>
>>> [1] http://urcu.so
>>>
>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
>>> Reviewed-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
>>> Reviewed-by: Josh Triplett <josh@xxxxxxxxxxxxxxxx>
>>> CC: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
>>> CC: Steven Rostedt <rostedt@xxxxxxxxxxx>
>>> CC: Nicholas Miell <nmiell@xxxxxxxxxxx>
>>> CC: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
>>> CC: Ingo Molnar <mingo@xxxxxxxxxx>
>>> CC: Alan Cox <gnomes@xxxxxxxxxxxxxxxxxxx>
>>> CC: Lai Jiangshan <laijs@xxxxxxxxxxxxxx>
>>> CC: Stephen Hemminger <stephen@xxxxxxxxxxxxxxxxxx>
>>> CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>>> CC: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>>> CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>>> CC: David Howells <dhowells@xxxxxxxxxx>
>>> CC: Pranith Kumar <bobby.prani@xxxxxxxxx>
>>> CC: Michael Kerrisk <mtk.manpages@xxxxxxxxx>
>>> CC: linux-api@xxxxxxxxxxxxxxx
>>>
>>> ---
>>>
>>> membarrier(2) man page:
>>> --------------- snip -------------------
>>> MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)
>>>
>>> NAME
>>> membarrier - issue memory barriers on a set of threads
>>>
>>> SYNOPSIS
>>> #include <linux/membarrier.h>
>>>
>>> int membarrier(int cmd, int flags);
>>>
>>> DESCRIPTION
>>> The cmd argument is one of the following:
>>>
>>> MEMBARRIER_CMD_QUERY
>>> Query the set of supported commands. It returns a bitmask of
>>> supported commands.
>>>
>>> MEMBARRIER_CMD_SHARED
>>> Execute a memory barrier on all threads running on the system.
>>> Upon return from system call, the caller thread is ensured that
>>> all running threads have passed through a state where all memory
>>> accesses to user-space addresses match program order between
>>> entry to and return from the system call (non-running threads
>>> are de facto in such a state). This covers threads from all proâ
>>> cesses running on the system. This command returns 0.
>>>
>>> The flags argument needs to be 0. For future extensions.
>>>
>>> All memory accesses performed in program order from each targeted
>>> thread is guaranteed to be ordered with respect to sys_membarrier(). If
>>> we use the semantic "barrier()" to represent a compiler barrier forcing
>>> memory accesses to be performed in program order across the barrier,
>>> and smp_mb() to represent explicit memory barriers forcing full memory
>>> ordering across the barrier, we have the following ordering table for
>>> each pair of barrier(), sys_membarrier() and smp_mb():
>>>
>>> The pair ordering is detailed as (O: ordered, X: not ordered):
>>>
>>> barrier() smp_mb() sys_membarrier()
>>> barrier() X X O
>>> smp_mb() X O O
>>> sys_membarrier() O O O
>>>
>>> RETURN VALUE
>>> On success, these system calls return zero. On error, -1 is returned,
>>> and errno is set appropriately. For a given command, with flags
>>> argument set to 0, this system call is guaranteed to always return the
>>> same value until reboot.
>>>
>>> ERRORS
>>> ENOSYS System call is not implemented.
>>>
>>> EINVAL Invalid arguments.
>>>
>>> Linux 2015-04-15 MEMBARRIER(2)
>>> --------------- snip -------------------
>>>
>>> Changes since v18:
>>> - Add unlikely() check to flags,
>>> - Describe current users in changelog.
>>>
>>> Changes since v17:
>>> - Update commit message.
>>>
>>> Changes since v16:
>>> - Update documentation.
>>> - Add man page to changelog.
>>> - Build sys_membarrier on !CONFIG_SMP. It allows userspace applications
>>> to not care about the number of processors on the system. Based on
>>> recommendations from Stephen Hemminger and Steven Rostedt.
>>> - Check that flags argument is 0, update documentation to require it.
>>>
>>> Changes since v15:
>>> - Add flags argument in addition to cmd.
>>> - Update documentation.
>>>
>>> Changes since v14:
>>> - Take care of Thomas Gleixner's comments.
>>>
>>> Changes since v13:
>>> - Move to kernel/membarrier.c.
>>> - Remove MEMBARRIER_PRIVATE flag.
>>> - Add MAINTAINERS file entry.
>>>
>>> Changes since v12:
>>> - Remove _FLAG suffix from uapi flags.
>>> - Add Expert menuconfig option CONFIG_MEMBARRIER (default=y).
>>> - Remove EXPEDITED mode. Only implement non-expedited for now, until
>>> reading the cpu_curr()->mm can be done without holding the CPU's rq
>>> lock.
>>>
>>> Changes since v11:
>>> - 5 years have passed.
>>> - Rebase on v3.19 kernel.
>>> - Add futex-alike PRIVATE vs SHARED semantic: private for per-process
>>> barriers, non-private for memory mappings shared between processes.
>>> - Simplify user API.
>>> - Code refactoring.
>>>
>>> Changes since v10:
>>> - Apply Randy's comments.
>>> - Rebase on 2.6.34-rc4 -tip.
>>>
>>> Changes since v9:
>>> - Clean up #ifdef CONFIG_SMP.
>>>
>>> Changes since v8:
>>> - Go back to rq spin locks taken by sys_membarrier() rather than adding
>>> memory barriers to the scheduler. It implies a potential RoS
>>> (reduction of service) if sys_membarrier() is executed in a busy-loop
>>> by a user, but nothing more than what is already possible with other
>>> existing system calls, but saves memory barriers in the scheduler fast
>>> path.
>>> - re-add the memory barrier comments to x86 switch_mm() as an example to
>>> other architectures.
>>> - Update documentation of the memory barriers in sys_membarrier and
>>> switch_mm().
>>> - Append execution scenarios to the changelog showing the purpose of
>>> each memory barrier.
>>>
>>> Changes since v7:
>>> - Move spinlock-mb and scheduler related changes to separate patches.
>>> - Add support for sys_membarrier on x86_32.
>>> - Only x86 32/64 system calls are reserved in this patch. It is planned
>>> to incrementally reserve syscall IDs on other architectures as these
>>> are tested.
>>>
>>> Changes since v6:
>>> - Remove some unlikely() not so unlikely.
>>> - Add the proper scheduler memory barriers needed to only use the RCU
>>> read lock in sys_membarrier rather than take each runqueue spinlock:
>>> - Move memory barriers from per-architecture switch_mm() to schedule()
>>> and finish_lock_switch(), where they clearly document that all data
>>> protected by the rq lock is guaranteed to have memory barriers issued
>>> between the scheduler update and the task execution. Replacing the
>>> spin lock acquire/release barriers with these memory barriers imply
>>> either no overhead (x86 spinlock atomic instruction already implies a
>>> full mb) or some hopefully small overhead caused by the upgrade of the
>>> spinlock acquire/release barriers to more heavyweight smp_mb().
>>> - The "generic" version of spinlock-mb.h declares both a mapping to
>>> standard spinlocks and full memory barriers. Each architecture can
>>> specialize this header following their own need and declare
>>> CONFIG_HAVE_SPINLOCK_MB to use their own spinlock-mb.h.
>>> - Note: benchmarks of scheduler overhead with specialized spinlock-mb.h
>>> implementations on a wide range of architecture would be welcome.
>>>
>>> Changes since v5:
>>> - Plan ahead for extensibility by introducing mandatory/optional masks
>>> to the "flags" system call parameter. Past experience with accept4(),
>>> signalfd4(), eventfd2(), epoll_create1(), dup3(), pipe2(), and
>>> inotify_init1() indicates that this is the kind of thing we want to
>>> plan for. Return -EINVAL if the mandatory flags received are unknown.
>>> - Create include/linux/membarrier.h to define these flags.
>>> - Add MEMBARRIER_QUERY optional flag.
>>>
>>> Changes since v4:
>>> - Add "int expedited" parameter, use synchronize_sched() in the
>>> non-expedited case. Thanks to Lai Jiangshan for making us consider
>>> seriously using synchronize_sched() to provide the low-overhead
>>> membarrier scheme.
>>> - Check num_online_cpus() == 1, quickly return without doing nothing.
>>>
>>> Changes since v3a:
>>> - Confirm that each CPU indeed runs the current task's ->mm before
>>> sending an IPI. Ensures that we do not disturb RT tasks in the
>>> presence of lazy TLB shootdown.
>>> - Document memory barriers needed in switch_mm().
>>> - Surround helper functions with #ifdef CONFIG_SMP.
>>>
>>> Changes since v2:
>>> - simply send-to-many to the mm_cpumask. It contains the list of
>>> processors we have to IPI to (which use the mm), and this mask is
>>> updated atomically.
>>>
>>> Changes since v1:
>>> - Only perform the IPI in CONFIG_SMP.
>>> - Only perform the IPI if the process has more than one thread.
>>> - Only send IPIs to CPUs involved with threads belonging to our process.
>>> - Adaptative IPI scheme (single vs many IPI with threshold).
>>> - Issue smp_mb() at the beginning and end of the system call.
>>> ---
>>> MAINTAINERS | 8 +++++
>>> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
>>> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
>>> include/linux/syscalls.h | 2 ++
>>> include/uapi/asm-generic/unistd.h | 4 ++-
>>> include/uapi/linux/Kbuild | 1 +
>>> include/uapi/linux/membarrier.h | 53 +++++++++++++++++++++++++++
>>> init/Kconfig | 12 +++++++
>>> kernel/Makefile | 1 +
>>> kernel/membarrier.c | 66 ++++++++++++++++++++++++++++++++++
>>> kernel/sys_ni.c | 3 ++
>>> 11 files changed, 151 insertions(+), 1 deletion(-)
>>> create mode 100644 include/uapi/linux/membarrier.h
>>> create mode 100644 kernel/membarrier.c
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index 0d70760..b560da6 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -6642,6 +6642,14 @@ W: http://www.mellanox.com
>>> Q: http://patchwork.ozlabs.org/project/netdev/list/
>>> F: drivers/net/ethernet/mellanox/mlx4/en_*
>>>
>>> +MEMBARRIER SUPPORT
>>> +M: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
>>> +M: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
>>> +L: linux-kernel@xxxxxxxxxxxxxxx
>>> +S: Supported
>>> +F: kernel/membarrier.c
>>> +F: include/uapi/linux/membarrier.h
>>> +
>>> MEMORY MANAGEMENT
>>> L: linux-mm@xxxxxxxxx
>>> W: http://www.linux-mm.org
>>> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl
>>> b/arch/x86/entry/syscalls/syscall_32.tbl
>>> index ef8187f..e63ad61 100644
>>> --- a/arch/x86/entry/syscalls/syscall_32.tbl
>>> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
>>> @@ -365,3 +365,4 @@
>>> 356 i386 memfd_create sys_memfd_create
>>> 357 i386 bpf sys_bpf
>>> 358 i386 execveat sys_execveat stub32_execveat
>>> +359 i386 membarrier sys_membarrier
>>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
>>> b/arch/x86/entry/syscalls/syscall_64.tbl
>>> index 9ef32d5..87f3cd6 100644
>>> --- a/arch/x86/entry/syscalls/syscall_64.tbl
>>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
>>> @@ -329,6 +329,7 @@
>>> 320 common kexec_file_load sys_kexec_file_load
>>> 321 common bpf sys_bpf
>>> 322 64 execveat stub_execveat
>>> +323 common membarrier sys_membarrier
>>>
>>> #
>>> # x32-specific system call numbers start at 512 to avoid cache impact
>>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>>> index b45c45b..d4ab99b 100644
>>> --- a/include/linux/syscalls.h
>>> +++ b/include/linux/syscalls.h
>>> @@ -884,4 +884,6 @@ asmlinkage long sys_execveat(int dfd, const char __user
>>> *filename,
>>> const char __user *const __user *argv,
>>> const char __user *const __user *envp, int flags);
>>>
>>> +asmlinkage long sys_membarrier(int cmd, int flags);
>>> +
>>> #endif
>>> diff --git a/include/uapi/asm-generic/unistd.h
>>> b/include/uapi/asm-generic/unistd.h
>>> index e016bd9..8da542a 100644
>>> --- a/include/uapi/asm-generic/unistd.h
>>> +++ b/include/uapi/asm-generic/unistd.h
>>> @@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
>>> __SYSCALL(__NR_bpf, sys_bpf)
>>> #define __NR_execveat 281
>>> __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
>>> +#define __NR_membarrier 282
>>> +__SYSCALL(__NR_membarrier, sys_membarrier)
>>>
>>> #undef __NR_syscalls
>>> -#define __NR_syscalls 282
>>> +#define __NR_syscalls 283
>>>
>>> /*
>>> * All syscalls below here should go away really,
>>> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
>>> index 1ff9942..e6f229a 100644
>>> --- a/include/uapi/linux/Kbuild
>>> +++ b/include/uapi/linux/Kbuild
>>> @@ -251,6 +251,7 @@ header-y += mdio.h
>>> header-y += media.h
>>> header-y += media-bus-format.h
>>> header-y += mei.h
>>> +header-y += membarrier.h
>>> header-y += memfd.h
>>> header-y += mempolicy.h
>>> header-y += meye.h
>>> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
>>> new file mode 100644
>>> index 0000000..e0b108b
>>> --- /dev/null
>>> +++ b/include/uapi/linux/membarrier.h
>>> @@ -0,0 +1,53 @@
>>> +#ifndef _UAPI_LINUX_MEMBARRIER_H
>>> +#define _UAPI_LINUX_MEMBARRIER_H
>>> +
>>> +/*
>>> + * linux/membarrier.h
>>> + *
>>> + * membarrier system call API
>>> + *
>>> + * Copyright (c) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
>>> + *
>>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>>> + * of this software and associated documentation files (the "Software"), to
>>> deal
>>> + * in the Software without restriction, including without limitation the rights
>>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>>> + * copies of the Software, and to permit persons to whom the Software is
>>> + * furnished to do so, subject to the following conditions:
>>> + *
>>> + * The above copyright notice and this permission notice shall be included in
>>> + * all copies or substantial portions of the Software.
>>> + *
>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
>>> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
>>> FROM,
>>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>>> THE
>>> + * SOFTWARE.
>>> + */
>>> +
>>> +/**
>>> + * enum membarrier_cmd - membarrier system call command
>>> + * @MEMBARRIER_CMD_QUERY: Query the set of supported commands. It returns
>>> + * a bitmask of valid commands.
>>> + * @MEMBARRIER_CMD_SHARED: Execute a memory barrier on all running threads.
>>> + * Upon return from system call, the caller thread
>>> + * is ensured that all running threads have passed
>>> + * through a state where all memory accesses to
>>> + * user-space addresses match program order between
>>> + * entry to and return from the system call
>>> + * (non-running threads are de facto in such a
>>> + * state). This covers threads from all processes
>>> + * running on the system. This command returns 0.
>>> + *
>>> + * Command to be passed to the membarrier system call. The commands need to
>>> + * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
>>> + * the value 0.
>>> + */
>>> +enum membarrier_cmd {
>>> + MEMBARRIER_CMD_QUERY = 0,
>>> + MEMBARRIER_CMD_SHARED = (1 << 0),
>>> +};
>>> +
>>> +#endif /* _UAPI_LINUX_MEMBARRIER_H */
>>> diff --git a/init/Kconfig b/init/Kconfig
>>> index af09b4f..4bba60f 100644
>>> --- a/init/Kconfig
>>> +++ b/init/Kconfig
>>> @@ -1577,6 +1577,18 @@ config PCI_QUIRKS
>>> bugs/quirks. Disable this only if your target machine is
>>> unaffected by PCI quirks.
>>>
>>> +config MEMBARRIER
>>> + bool "Enable membarrier() system call" if EXPERT
>>> + default y
>>> + help
>>> + Enable the membarrier() system call that allows issuing memory
>>> + barriers across all running threads, which can be used to distribute
>>> + the cost of user-space memory barriers asymmetrically by transforming
>>> + pairs of memory barriers into pairs consisting of membarrier() and a
>>> + compiler barrier.
>>> +
>>> + If unsure, say Y.
>>> +
>>> config EMBEDDED
>>> bool "Embedded system"
>>> option allnoconfig_y
>>> diff --git a/kernel/Makefile b/kernel/Makefile
>>> index 43c4c92..92a481b 100644
>>> --- a/kernel/Makefile
>>> +++ b/kernel/Makefile
>>> @@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
>>> obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>>> obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
>>> obj-$(CONFIG_TORTURE_TEST) += torture.o
>>> +obj-$(CONFIG_MEMBARRIER) += membarrier.o
>>>
>>> $(obj)/configs.o: $(obj)/config_data.h
>>>
>>> diff --git a/kernel/membarrier.c b/kernel/membarrier.c
>>> new file mode 100644
>>> index 0000000..536c727
>>> --- /dev/null
>>> +++ b/kernel/membarrier.c
>>> @@ -0,0 +1,66 @@
>>> +/*
>>> + * Copyright (C) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
>>> + *
>>> + * membarrier system call
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License as published by
>>> + * the Free Software Foundation; either version 2 of the License, or
>>> + * (at your option) any later version.
>>> + *
>>> + * This program is distributed in the hope that it will be useful,
>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>>> + * GNU General Public License for more details.
>>> + */
>>> +
>>> +#include <linux/syscalls.h>
>>> +#include <linux/membarrier.h>
>>> +
>>> +/*
>>> + * Bitmask made from a "or" of all commands within enum membarrier_cmd,
>>> + * except MEMBARRIER_CMD_QUERY.
>>> + */
>>> +#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
>>> +
>>> +/**
>>> + * sys_membarrier - issue memory barriers on a set of threads
>>> + * @cmd: Takes command values defined in enum membarrier_cmd.
>>> + * @flags: Currently needs to be 0. For future extensions.
>>> + *
>>> + * If this system call is not implemented, -ENOSYS is returned. If the
>>> + * command specified does not exist, or if the command argument is invalid,
>>> + * this system call returns -EINVAL. For a given command, with flags argument
>>> + * set to 0, this system call is guaranteed to always return the same value
>>> + * until reboot.
>>> + *
>>> + * All memory accesses performed in program order from each targeted thread
>>> + * is guaranteed to be ordered with respect to sys_membarrier(). If we use
>>> + * the semantic "barrier()" to represent a compiler barrier forcing memory
>>> + * accesses to be performed in program order across the barrier, and
>>> + * smp_mb() to represent explicit memory barriers forcing full memory
>>> + * ordering across the barrier, we have the following ordering table for
>>> + * each pair of barrier(), sys_membarrier() and smp_mb():
>>> + *
>>> + * The pair ordering is detailed as (O: ordered, X: not ordered):
>>> + *
>>> + * barrier() smp_mb() sys_membarrier()
>>> + * barrier() X X O
>>> + * smp_mb() X O O
>>> + * sys_membarrier() O O O
>>> + */
>>> +SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
>>> +{
>>> + if (unlikely(flags))
>>> + return -EINVAL;
>>> + switch (cmd) {
>>> + case MEMBARRIER_CMD_QUERY:
>>> + return MEMBARRIER_CMD_BITMASK;
>>> + case MEMBARRIER_CMD_SHARED:
>>> + if (num_online_cpus() > 1)
>>> + synchronize_sched();
>>> + return 0;
>>> + default:
>>> + return -EINVAL;
>>> + }
>>> +}
>>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>>> index 7995ef5..eb4fde0 100644
>>> --- a/kernel/sys_ni.c
>>> +++ b/kernel/sys_ni.c
>>> @@ -243,3 +243,6 @@ cond_syscall(sys_bpf);
>>>
>>> /* execveat */
>>> cond_syscall(sys_execveat);
>>> +
>>> +/* membarrier */
>>> +cond_syscall(sys_membarrier);
>>>
>>
>>
>> --
>> Michael Kerrisk
>> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
>> Linux/UNIX System Programming Training: http://man7.org/training/
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/