Re: [PATCH] block: fix a crash in do_task_dead()
From: Peter Zijlstra
Date: Fri Jun 07 2019 - 10:27:55 EST
On Fri, Jun 07, 2019 at 03:35:41PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 05, 2019 at 09:04:02AM -0600, Jens Axboe wrote:
> > How about the following plan - if folks are happy with this sched patch,
> > we can queue it up for 5.3. Once that is in, I'll kill the block change
> > that special cases the polled task wakeup. For 5.2, we go with Oleg's
> > patch for the swap case.
>
> OK, works for me. I'll go write a proper patch.
I now have the below; I'll queue that after the long weekend and let
0-day chew on it for a while and then push it out to tip or something.
---
Subject: sched: Optimize try_to_wake_up() for local wakeups
From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Date: Fri Jun 7 15:39:49 CEST 2019
Jens reported that significant performance can be had on some block
workloads (XXX numbers?) by special casing local wakeups. That is,
wakeups on the current task before it schedules out. Given something
like the normal wait pattern:
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (cond)
break;
schedule();
}
__set_current_state(TASK_RUNNING);
Any wakeup (on this CPU) after set_current_state() and before
schedule() would benefit from this.
Normal wakeups take p->pi_lock, which serializes wakeups to the same
task. By eliding that we gain concurrency on:
- ttwu_stat(); we already had concurrency on rq stats, this now also
brings it to task stats. -ENOCARE
- tracepoints; it is now possible to get multiple instances of
trace_sched_waking() (and possibly trace_sched_wakeup()) for the
same task. Tracers will have to learn to cope.
Furthermore, p->pi_lock is used by set_special_state(), to order
against TASK_RUNNING stores from other CPUs. But since this is
strictly CPU local, we don't need the lock, and set_special_state()'s
disabling of IRQs is sufficient.
After the normal wakeup takes p->pi_lock it issues
smp_mb__after_spinlock(), in order to ensure the woken task must
observe prior stores before we observe the p->state. If this is CPU
local, this will be satisfied with a compiler barrier, and we rely on
try_to_wake_up() being a funcation call, which implies such.
Since, when 'p == current', 'p->on_rq' must be true, the normal wakeup
would continue into the ttwu_remote() branch, which normally is
concerned with exactly this wakeup scenario, except from a remote CPU.
IOW we're waking a task that is still running. In this case, we can
trivially avoid taking rq->lock, all that's left from this is to set
p->state.
This then yields an extremely simple and fast path for 'p == current'.
Cc: Qian Cai <cai@xxxxxx>
Cc: mingo@xxxxxxxxxx
Cc: akpm@xxxxxxxxxxxxxxxxxxxx
Cc: hch@xxxxxx
Cc: gkohli@xxxxxxxxxxxxxx
Cc: oleg@xxxxxxxxxx
Reported-by: Jens Axboe <axboe@xxxxxxxxx>
Tested-by: Jens Axboe <axboe@xxxxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
---
kernel/sched/core.c | 33 ++++++++++++++++++++++++++++-----
1 file changed, 28 insertions(+), 5 deletions(-)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1991,6 +1991,28 @@ try_to_wake_up(struct task_struct *p, un
unsigned long flags;
int cpu, success = 0;
+ if (p == current) {
+ /*
+ * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
+ * == smp_processor_id()'. Together this means we can special
+ * case the whole 'p->on_rq && ttwu_remote()' case below
+ * without taking any locks.
+ *
+ * In particular:
+ * - we rely on Program-Order guarantees for all the ordering,
+ * - we're serialized against set_special_state() by virtue of
+ * it disabling IRQs (this allows not taking ->pi_lock).
+ */
+ if (!(p->state & state))
+ return false;
+
+ success = 1;
+ trace_sched_waking(p);
+ p->state = TASK_RUNNING;
+ trace_sched_wakeup(p);
+ goto out;
+ }
+
/*
* If we are going to wake up a thread waiting for CONDITION we
* need to ensure that CONDITION=1 done by the caller can not be
@@ -2000,7 +2022,7 @@ try_to_wake_up(struct task_struct *p, un
raw_spin_lock_irqsave(&p->pi_lock, flags);
smp_mb__after_spinlock();
if (!(p->state & state))
- goto out;
+ goto unlock;
trace_sched_waking(p);
@@ -2030,7 +2052,7 @@ try_to_wake_up(struct task_struct *p, un
*/
smp_rmb();
if (p->on_rq && ttwu_remote(p, wake_flags))
- goto stat;
+ goto unlock;
#ifdef CONFIG_SMP
/*
@@ -2090,10 +2112,11 @@ try_to_wake_up(struct task_struct *p, un
#endif /* CONFIG_SMP */
ttwu_queue(p, cpu, wake_flags);
-stat:
- ttwu_stat(p, cpu, wake_flags);
-out:
+unlock:
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+out:
+ if (success)
+ ttwu_stat(p, cpu, wake_flags);
return success;
}