Re: [RFC PATCH v7 1/7] Restartable sequences system call

From: Boqun Feng
Date: Thu Aug 11 2016 - 00:54:38 EST


On Wed, Aug 10, 2016 at 05:33:44PM +0000, Mathieu Desnoyers wrote:
> ----- On Aug 9, 2016, at 12:13 PM, Boqun Feng boqun.feng@xxxxxxxxx wrote:
>
> <snip>
>
> >
> > However, I'm thinking maybe we can use some tricks to avoid unnecessary
> > aborts-on-preemption.
> >
> > First of all, I notice we haven't make any constraint on what kind of
> > memory objects could be "protected" by rseq critical sections yet. And I
> > think this is something we should decide before adding this feature into
> > kernel.
> >
> > We can do some optimization if we have some constraints. For example, if
> > the memory objects inside the rseq critical sections could only be
> > modified by userspace programs, we therefore don't need to abort
> > immediately when userspace task -> kernel task context switch.
>
> The rseq_owner per-cpu variable and rseq_cpu field in task_struct you
> propose below would indeed take care of this scenario.
>
> >
> > Further more, if the memory objects inside the rseq critical sections
> > could only be modified by userspace programs that have registered their
> > rseq structures, we don't need to abort immediately between the context
> > switches between two rseq-unregistered tasks or one rseq-registered
> > task and one rseq-unregistered task.
> >
> > Instead, we do tricks as follow:
> >
> > defining a percpu pointer in kernel:
> >
> > DEFINE_PER_CPU(struct task_struct *, rseq_owner);
> >
> > and a cpu field in struct task_struct:
> >
> > struct task_struct {
> > ...
> > #ifdef CONFIG_RSEQ
> > struct rseq __user *rseq;
> > uint32_t rseq_event_counter;
> > int rseq_cpu;
> > #endif
> > ...
> > };
> >
> > (task_struct::rseq_cpu should be initialized as -1.)
> >
> > each time at sched out(in rseq_sched_out()), we do something like:
> >
> > if (prev->rseq) {
> > raw_cpu_write(rseq_owner, prev);
> > prev->rseq_cpu = smp_processor_id();
> > }
> >
> > each time sched in(in rseq_handle_notify_resume()), we do something
> > like:
> >
> > if (current->rseq &&
> > (this_cpu_read(rseq_owner) != current ||
> > current->rseq_cpu != smp_processor_id()))
> > __rseq_handle_notify_resume(regs);
> >
> > (Also need to modify rseq_signal_deliver() to call
> > __rseq_handle_notify_resume() directly).
> >
> >
> > I think this could save some unnecessary aborts-on-preemption, however,
> > TBH, I'm too sleepy to verify every corner case. Will recheck this
> > tomorrow.
>
> This adds extra fields to the task struct, per-cpu rseq_owner pointers,
> and hooks into sched_in which are not needed otherwise, all this to
> eliminate unneeded abort-on-preemption.
>
> If we look at the single-stepping use-case, this means that gdb would
> only be able to single-step applications as long as neither itself, nor
> any of its libraries, use rseq. This seems to be quite fragile. I prefer
> requiring rseq users to implement a fallback to locking which progresses
> in every situation rather than adding complexity and overhead trying
> lessen the odds of triggering the restart.
>
> Simply lessening the odds of triggering the restart without a design that
> ensures progress even in restart cases seems to make the lack-of-progress
> problem just harder to debug when it will surface in real life.
>

Fair enough.

I did my own research of the mechanism I proposed. The patch is attached
at the end of the email. Unfortunately, there is no noticeable
performance gain for the current benchmark. One possible reason may be:
The rseq critical sections in current benchmark are quite small, which
makes retrying is not that expensive.

From another angle, this may imply that in current senarios,
abort-on-preemption doesn't hurt the performance much. But these are
only my two cents.

> Thanks,
>
> Mathieu
>

---
include/linux/sched.h | 18 +++++++++++++++---
kernel/rseq.c | 2 ++
2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5875fdd6edc8..c23e5dee9c60 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1922,6 +1922,7 @@ struct task_struct {
#ifdef CONFIG_RSEQ
struct rseq __user *rseq;
u32 rseq_event_counter;
+ int rseq_cpu;
#endif
/* CPU-specific state of this task */
struct thread_struct thread;
@@ -3393,6 +3394,8 @@ void cpufreq_remove_update_util_hook(int cpu);
#endif /* CONFIG_CPU_FREQ */

#ifdef CONFIG_RSEQ
+DECLARE_PER_CPU(struct task_struct *, rseq_owner);
+
static inline void rseq_set_notify_resume(struct task_struct *t)
{
if (t->rseq)
@@ -3401,7 +3404,9 @@ static inline void rseq_set_notify_resume(struct task_struct *t)
void __rseq_handle_notify_resume(struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
- if (current->rseq)
+ if (current->rseq &&
+ (current != raw_cpu_read(rseq_owner) ||
+ current->rseq_cpu != smp_processor_id()))
__rseq_handle_notify_resume(regs);
}
/*
@@ -3415,9 +3420,11 @@ static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
if (clone_flags & CLONE_THREAD) {
t->rseq = NULL;
t->rseq_event_counter = 0;
+ t->rseq_cpu = -1;
} else {
t->rseq = current->rseq;
t->rseq_event_counter = current->rseq_event_counter;
+ t->rseq_cpu = -1;
rseq_set_notify_resume(t);
}
}
@@ -3428,11 +3435,16 @@ static inline void rseq_execve(struct task_struct *t)
}
static inline void rseq_sched_out(struct task_struct *t)
{
- rseq_set_notify_resume(t);
+ if (t->rseq) {
+ raw_cpu_write(rseq_owner, t);
+ t->rseq_cpu = smp_processor_id();
+ rseq_set_notify_resume(t);
+ }
}
static inline void rseq_signal_deliver(struct pt_regs *regs)
{
- rseq_handle_notify_resume(regs);
+ if (current->rseq)
+ __rseq_handle_notify_resume(regs);
}
#else
static inline void rseq_set_notify_resume(struct task_struct *t)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 7e4d1d0e9520..0390a57ef0e5 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -100,6 +100,8 @@
* F2. Return false.
*/

+DEFINE_PER_CPU(struct task_struct *, rseq_owner);
+
/*
* The rseq_event_counter allow user-space to detect preemption and
* signal delivery. It increments at least once before returning to
--
2.9.0

Attachment: signature.asc
Description: PGP signature