Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

From: Paul Turner
Date: Thu May 21 2015 - 16:07:45 EST

Next message: Maxime Coquelin: "Re: [PATCH v8 14/16] ARM: dts: Introduce STM32F429 MCU"
Previous message: Pali RohÃr: "Re: randconfig build error with next-20150521, in drivers/platform/x86/dell-rbtn.c"
In reply to: Paul Turner: "Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections"
Next in thread: Mathieu Desnoyers: "Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Very nice!

Some inline notes:

On Thu, May 21, 2015 at 7:44 AM, Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> Expose a new system call allowing userspace threads to register
> a TLS area used as an ABI between the kernel and userspace to
> share information required to create efficient per-cpu critical
> sections in user-space.
>
> This ABI consists of a thread-local structure containing:
>
> - a nesting count surrounding the critical section,
> - a signal number to be sent to the thread when preempting a thread
> with non-zero nesting count,
> - a flag indicating whether the signal has been sent within the
> critical section,
> - an integer where to store the current CPU number, updated whenever
> the thread is preempted. This CPU number cache is not strictly
> needed, but performs better than getcpu vdso.

While it is fractionally faster, this is actually something we've
strongly considered dropping. The complexity of correct TLS
addressing is non-trivial.

>
> This approach is inspired by Paul Turner and Andrew Hunter's work
> on percpu atomics, which lets the kernel handle restart of critical
> sections, ref. http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
>
> What is done differently here compared to percpu atomics: we track
> a single nesting counter per thread rather than many ranges of
> instruction pointer values. We deliver a signal to user-space and
> let the logic of restart be handled in user-space, thus moving
> the complexity out of the kernel. The nesting counter approach
> allows us to skip the complexity of interacting with signals that
> would be otherwise needed with the percpu atomics approach, which
> needs to know which instruction pointers are preempted, including
> when preemption occurs on a signal handler nested over an instruction
> pointer of interest.
>
> Advantages of this approach over percpu atomics:
> - kernel code is relatively simple: complexity of restart sections
> is in user-space,

I think delivering a signal is actually much more complex than the
address check here.
> - easy to port to other architectures: just need to reserve a new
> system call,

Not having an ABI is a potential nice advantage here.

> - for threads which have registered a TLS structure, the fast-path
> at preemption is only a nesting counter check, along with the
> optional store of the current CPU number, rather than comparing
> instruction pointer with possibly many registered ranges,

I'd argue this is not actually an advantage: The preemption is the
_slow path_. For a reasonably behaving application this is something
that should be happening on O(ms) boundaries. Whereas entry/exit of
the critical section may be occurring much more frequently. This is
also coming at the cost of a slower fast path (nesting updates).

It's also worth noting that the range comparison can be done in logN
(binary search) or even constant time (bound max sequence size and
mask out bits on address, or use a lookup table).

That said there are two very specific advantages:
1) CONFIG_PREEMPT greatly complicates correct inspection of the user address.
2) One region per address space is a difficult model for binaries with
shared linkage.

[ Note: Compatibility of (1) is considered the primary problem with
our current patch-set. I'll publish it as is, then try to follow up
with some alternatives we've been considering to get around this. ]

>
> Caveats of this approach compared to the percpu atomics:
> - We need a signal number for this, so it cannot be done without
> designing the application accordingly,
> - Handling restart in user-space is currently performed with page
> protection, for which we install a SIGSEGV signal handler. Again,
> this requires designing the application accordingly, especially
> if the application installs its own segmentation fault handler,
> - It cannot be used for tracing of processes by injection of code
> into their address space, due to interactions with application
> signal handlers.
>
> The user-space proof of concept code implementing the restart section
> can be found here: https://github.com/compudj/percpu-dev
>
> Benchmarking sched_getcpu() vs tls cache approach. Getting the
> current CPU number:
>
> - With Linux vdso: 12.7 ns
> - With TLS-cached cpu number: 0.3 ns
>

Have you tried also a naked read against the segment limit? This is
more comparable to TLS. However, it's a definite ABI challenge.
> We will use the TLS-cached cpu number for the following
> benchmarks.
>
> On an Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, comparison
> with a baseline running very few load/stores (no locking,
> no getcpu, assuming one thread per CPU with affinity),
> against locking scheme based on "lock; cmpxchg", "cmpxchg"
> (using restart signal), load-store (using restart signal).
> This is performed with 32 threads on a 16-core, hyperthread
> system:
>
> ns/loop overhead (ns)
> Baseline: 3.7 0.0
> lock; cmpxchg: 22.0 18.3
> cmpxchg: 11.1 7.4
> load-store: 9.4 5.7
>
> Therefore, the load-store scheme has a speedup of 3.2x over the
> "lock; cmpxchg" scheme if both are using the tls-cache for the
> CPU number. If we use Linux sched_getcpu() for "lock; cmpxchg"
> we reach of speedup of 5.4x for load-store+tls-cache vs
> "lock; cmpxchg"+vdso-getcpu.
>
> I'm sending this out to trigger discussion, and hopefully to see
> Paul and Andrew's patches being posted publicly at some point, so
> we can compare our approaches.
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> CC: Paul Turner <pjt@xxxxxxxxxx>
> CC: Andrew Hunter <ahh@xxxxxxxxxx>
> CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> CC: Ingo Molnar <mingo@xxxxxxxxxx>
> CC: Ben Maurer <bmaurer@xxxxxx>
> CC: Steven Rostedt <rostedt@xxxxxxxxxxx>
> CC: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
> CC: Josh Triplett <josh@xxxxxxxxxxxxxxxx>
> CC: Lai Jiangshan <laijs@xxxxxxxxxxxxxx>
> CC: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> ---
> arch/x86/syscalls/syscall_64.tbl | 1 +
> fs/exec.c | 1 +
> include/linux/sched.h | 18 ++++++
> include/uapi/asm-generic/unistd.h | 4 +-
> init/Kconfig | 10 +++
> kernel/Makefile | 1 +
> kernel/fork.c | 2 +
> kernel/percpu-user.c | 126 ++++++++++++++++++++++++++++++++++++++
> kernel/sys_ni.c | 3 +
> 9 files changed, 165 insertions(+), 1 deletion(-)
> create mode 100644 kernel/percpu-user.c
>
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index 8d656fb..0499703 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -329,6 +329,7 @@
> 320 common kexec_file_load sys_kexec_file_load
> 321 common bpf sys_bpf
> 322 64 execveat stub_execveat
> +323 common percpu sys_percpu
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/exec.c b/fs/exec.c
> index c7f9b73..0a2f0b2 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1555,6 +1555,7 @@ static int do_execveat_common(int fd, struct filename *filename,
> /* execve succeeded */
> current->fs->in_exec = 0;
> current->in_execve = 0;
> + percpu_user_execve(current);
> acct_update_integrals(current);
> task_numa_free(current);
> free_bprm(bprm);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a419b65..9c88bff 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1275,6 +1275,8 @@ enum perf_event_task_context {
> perf_nr_task_contexts,
> };
>
> +struct thread_percpu_user;
> +
> struct task_struct {
> volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
> void *stack;
> @@ -1710,6 +1712,10 @@ struct task_struct {
> #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> unsigned long task_state_change;
> #endif
> +#ifdef CONFIG_PERCPU_USER
> + struct preempt_notifier percpu_user_notifier;
> + struct thread_percpu_user __user *percpu_user;
> +#endif
> };
>
> /* Future-safe accessor for struct task_struct's cpus_allowed. */
> @@ -3090,4 +3096,16 @@ static inline unsigned long rlimit_max(unsigned int limit)
> return task_rlimit_max(current, limit);
> }
>
> +#ifdef CONFIG_PERCPU_USER
> +void percpu_user_fork(struct task_struct *t);
> +void percpu_user_execve(struct task_struct *t);
> +#else
> +static inline void percpu_user_fork(struct task_struct *t)
> +{
> +}
> +static inline void percpu_user_execve(struct task_struct *t)
> +{
> +}
> +#endif
> +
> #endif
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index e016bd9..f4350d9 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
> __SYSCALL(__NR_bpf, sys_bpf)
> #define __NR_execveat 281
> __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
> +#define __NR_percpu 282
> +__SYSCALL(__NR_percpu, sys_percpu)
>
> #undef __NR_syscalls
> -#define __NR_syscalls 282
> +#define __NR_syscalls 283
>
> /*
> * All syscalls below here should go away really,
> diff --git a/init/Kconfig b/init/Kconfig
> index f5dbc6d..73c4070 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1559,6 +1559,16 @@ config PCI_QUIRKS
> bugs/quirks. Disable this only if your target machine is
> unaffected by PCI quirks.
>
> +config PERCPU_USER
> + bool "Enable percpu() system call" if EXPERT
> + default y
> + select PREEMPT_NOTIFIERS
> + help
> + Enable the percpu() system call which provides a building block
> + for fast per-cpu critical sections in user-space.
> +
> + If unsure, say Y.
> +
> config EMBEDDED
> bool "Embedded system"
> option allnoconfig_y
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 1408b33..76919a6 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -96,6 +96,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
> obj-$(CONFIG_JUMP_LABEL) += jump_label.o
> obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
> obj-$(CONFIG_TORTURE_TEST) += torture.o
> +obj-$(CONFIG_PERCPU_USER) += percpu-user.o
>
> $(obj)/configs.o: $(obj)/config_data.h
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index cf65139..63aaf5a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1549,6 +1549,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> cgroup_post_fork(p);
> if (clone_flags & CLONE_THREAD)
> threadgroup_change_end(current);
> + if (!(clone_flags & CLONE_THREAD))
> + percpu_user_fork(p);
> perf_event_fork(p);
>
> trace_task_newtask(p, clone_flags);
> diff --git a/kernel/percpu-user.c b/kernel/percpu-user.c
> new file mode 100644
> index 0000000..be3d439
> --- /dev/null
> +++ b/kernel/percpu-user.c
> @@ -0,0 +1,126 @@
> +/*
> + * Copyright (C) 2015 Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> + *
> + * percpu system call
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/preempt.h>
> +#include <linux/init.h>
> +#include <linux/sched.h>
> +#include <linux/uaccess.h>
> +#include <linux/syscalls.h>
> +
> +struct thread_percpu_user {
> + int32_t nesting;
> + int32_t signal_sent;
> + int32_t signo;
> + int32_t current_cpu;
> +};
> +
> +static void percpu_user_sched_in(struct preempt_notifier *notifier, int cpu)
> +{
> + struct thread_percpu_user __user *tpu_user;
> + struct thread_percpu_user tpu;
> + struct task_struct *t = current;
> +
> + tpu_user = t->percpu_user;
> + if (tpu_user == NULL)
> + return;
> + if (unlikely(t->flags & PF_EXITING))
> + return;
> + /*
> + * access_ok() of tpu_user has already been checked by sys_percpu().
> + */
> + if (__put_user(smp_processor_id(), &tpu_user->current_cpu)) {
> + WARN_ON_ONCE(1);
> + return;
> + }
> + if (__copy_from_user(&tpu, tpu_user, sizeof(tpu))) {
> + WARN_ON_ONCE(1);
> + return;
> + }
> + if (!tpu.nesting || tpu.signal_sent)
> + return;
> + if (do_send_sig_info(tpu.signo, SEND_SIG_PRIV, t, 0)) {
> + WARN_ON_ONCE(1);
> + return;
> + }
> + tpu.signal_sent = 1;
> + if (__copy_to_user(tpu_user, &tpu, sizeof(tpu))) {
> + WARN_ON_ONCE(1);
> + return;
> + }
> +}
> +
> +static void percpu_user_sched_out(struct preempt_notifier *notifier,
> + struct task_struct *next)
> +{
> +}
> +
> +static struct preempt_ops percpu_user_ops = {
> + .sched_in = percpu_user_sched_in,
> + .sched_out = percpu_user_sched_out,
> +};
> +
> +/*
> + * If parent had a percpu-user preempt notifier, we need to setup our own.
> + */
> +void percpu_user_fork(struct task_struct *t)
> +{
> + struct task_struct *parent = current;
> +
> + if (!parent->percpu_user)
> + return;
> + preempt_notifier_init(&t->percpu_user_notifier, &percpu_user_ops);
> + preempt_notifier_register(&t->percpu_user_notifier);
> + t->percpu_user = parent->percpu_user;
> +}
> +
> +void percpu_user_execve(struct task_struct *t)
> +{
> + if (!t->percpu_user)
> + return;
> + preempt_notifier_unregister(&t->percpu_user_notifier);
> + t->percpu_user = NULL;
> +}
> +
> +/*
> + * sys_percpu - setup user-space per-cpu critical section for caller thread
> + */
> +SYSCALL_DEFINE1(percpu, struct thread_percpu_user __user *, tpu)
> +{
> + struct task_struct *t = current;
> +
> + if (tpu == NULL) {
> + if (t->percpu_user)
> + preempt_notifier_unregister(&t->percpu_user_notifier);
> + goto set_tpu;
> + }
> + if (!access_ok(VERIFY_WRITE, tpu, sizeof(struct thread_percpu_user)))
> + return -EFAULT;
> + preempt_disable();
> + if (__put_user(smp_processor_id(), &tpu->current_cpu)) {
> + WARN_ON_ONCE(1);
> + preempt_enable();
> + return -EFAULT;
> + }
> + preempt_enable();
> + if (!current->percpu_user) {
> + preempt_notifier_init(&t->percpu_user_notifier,
> + &percpu_user_ops);
> + preempt_notifier_register(&t->percpu_user_notifier);
> + }
> +set_tpu:
> + current->percpu_user = tpu;
> + return 0;
> +}
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 5adcb0a..16e2bc8 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -229,3 +229,6 @@ cond_syscall(sys_bpf);
>
> /* execveat */
> cond_syscall(sys_execveat);
> +
> +/* percpu userspace critical sections */
> +cond_syscall(sys_percpu);
> --
> 2.1.4
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Maxime Coquelin: "Re: [PATCH v8 14/16] ARM: dts: Introduce STM32F429 MCU"
Previous message: Pali RohÃr: "Re: randconfig build error with next-20150521, in drivers/platform/x86/dell-rbtn.c"
In reply to: Paul Turner: "Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections"
Next in thread: Mathieu Desnoyers: "Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]