Re: [PATCH 2/2] futex: lockless wakeups

From: Davidlohr Bueso
Date: Mon Apr 20 2015 - 09:56:02 EST


On Mon, 2015-04-20 at 08:18 +0200, Ingo Molnar wrote:
> Please write a small description we can cite to driver authors once
> the (inevitable) breakages appear, outlining this new behavior and its
> implications, so that we can fix any remaining bugs ASAP.

I wouldn't call this new behavior, simply because changing a critical
region should not be labeled as such imho. Other than asking driver
authors to put their schedule() in a loop to confirm that the expected
condition has in fact occurred, I'm not sure what else we can ask them
to do -- as you know, this is not just about futexes.

> I'll also let this pending a bit longer than other changes, to make
> sure we shake out any bugs/regressions triggered by this change.
>
> Third, it might make sense to add a new 'spurious wakeup injection
> debug mechanism' that, if enabled in the .config, automatically and
> continuously inserts spurious wakeups at a given, slightly randomized
> rate - which would ensure that all kernel facilities can robustly
> handle spurious wakeups.

I have been using this from Peter to test against:

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432..fdf1f68 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -214,9 +214,10 @@ print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
#define TASK_WAKEKILL 128
#define TASK_WAKING 256
#define TASK_PARKED 512
-#define TASK_STATE_MAX 1024
+#define TASK_YIELD 1024
+#define TASK_STATE_MAX 2048

-#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWP"
+#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPY"

extern char ___assert_task_state[1 - 2*!!(
sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f0f831e..2c938ae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1005,7 +1005,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
* ttwu() will sort out the placement.
*/
WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
- !p->on_rq);
+ !p->on_rq && !(p->state & TASK_YIELD));

#ifdef CONFIG_LOCKDEP
/*
@@ -2743,6 +2743,23 @@ static void __sched __schedule(void)
if (unlikely(signal_pending_state(prev->state, prev))) {
prev->state = TASK_RUNNING;
} else {
+
+ /*
+ * Provide an auto-yield feature on schedule().
+ *
+ * The thought is to avoid a sleep+wakeup cycle
+ * if simply yielding the cpu will suffice to
+ * satisfy the required condition.
+ *
+ * Assumes the calling schedule() site can deal
+ * with spurious wakeups.
+ */
+ if (prev->state & TASK_YIELD) {
+ prev->state &= ~TASK_YIELD;
+ if (rq->nr_running > 1)
+ goto no_deactivate;
+ }
+
deactivate_task(rq, prev, DEQUEUE_SLEEP);
prev->on_rq = 0;

@@ -2759,6 +2776,7 @@ static void __sched __schedule(void)
try_to_wake_up_local(to_wakeup);
}
}
+ no_deactivate:
switch_count = &prev->nvcsw;
}


> My guess would be that most remaining fragilities against spurious
> wakeups ought to be in the boot/init phase, so I'd keep an eye out for
> suspend/resume regressions.

Correct, which is why I'm not that concerned anymore about spurious
wakups, in fact that code above now boots and handles correctly on
rather large systems.

>
> > [...] However there is core code that cannot handle them afaict, and
> > furthermore tglx does have the point that other events can already
> > trigger them anyway.
>
> s/there is core code/there is no core code

heh yes.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/