Re: Too many rescheduling interrupts (still!)

From: Peter Zijlstra
Date: Wed Feb 12 2014 - 11:39:34 EST


On Wed, Feb 12, 2014 at 07:49:07AM -0800, Andy Lutomirski wrote:
> On Wed, Feb 12, 2014 at 2:13 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote:
> >> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> >> >> A small number of reschedule interrupts appear to be due to a race:
> >> >> both resched_task and wake_up_idle_cpu do, essentially:
> >> >>
> >> >> set_tsk_need_resched(t);
> >> >> smb_mb();
> >> >> if (!tsk_is_polling(t))
> >> >> smp_send_reschedule(cpu);
> >> >>
> >> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
> >> >> is too quick (which isn't surprising if it was in C0 or C1), then it
> >> >> could *clear* TS_POLLING before tsk_is_polling is read.
> >
> > Yeah we have the wrong default for the idle loops.. it should default to
> > polling and only switch to !polling at the very last moment if it really
> > needs an interrupt to wake.
>
> I might be missing something, but won't that break the scheduler?

for the idle task.. all other tasks will have it !polling.

But note how the current generic idle loop does:

if (!current_clr_polling_and_test()) {
...
if (cpuidle_idle_call())
arch_cpu_idle();
...
}

This means that it still runs a metric ton of code, right up to the
mwait with !polling, and then at the mwait we switch it back to polling.

Completely daft.

> Since rq->lock is held, the resched calls could check the rq state
> (curr == idle, maybe) to distinguish these cases.

Not enough; but I'm afraid I confused you with the above.

My suggestion was really more that we should call into the cpuidle/arch
idle code with polling set, and only right before we hit hlt/wfi/etc..
should we clear the polling bit.

> > It can't we're holding its rq->lock.
>
> Exactly. AFAICT the only reason that any of this code holds rq->lock
> (especially ttwu_queue_remote, which I seem to call a few thousand
> times per second) is because the only way to make a cpu reschedule
> involves playing with per-task flags. If the flags were per-rq or
> per-cpu instead, then rq->lock wouldn't be needed. If this were all
> done locklessly, then I think either a full cmpxchg or some fairly
> careful use of full barriers would be needed, but I bet that cmpxchg
> is still considerably faster than a spinlock plus a set_bit.

Ahh, that's what you're saying. Yes we should be able to do something
clever there.

Something like the below is I think as close as we can come without
major surgery and moving TIF_NEED_RESCHED and POLLING into a per-cpu
variable.

I might have messed it up though; brain seems to have given out for the
day :/

---
kernel/sched/core.c | 17 +++++++++++++----
kernel/sched/idle.c | 21 +++++++++++++--------
kernel/sched/sched.h | 5 ++++-
3 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fb9764fbc537..a5b64040c21d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -529,7 +529,7 @@ void resched_task(struct task_struct *p)
}

/* NEED_RESCHED must be visible before we test polling */
- smp_mb();
+ smp_mb__after_clear_bit();
if (!tsk_is_polling(p))
smp_send_reschedule(cpu);
}
@@ -1476,12 +1476,15 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
}

#ifdef CONFIG_SMP
-static void sched_ttwu_pending(void)
+void sched_ttwu_pending(void)
{
struct rq *rq = this_rq();
struct llist_node *llist = llist_del_all(&rq->wake_list);
struct task_struct *p;

+ if (!llist)
+ return;
+
raw_spin_lock(&rq->lock);

while (llist) {
@@ -1536,8 +1539,14 @@ void scheduler_ipi(void)

static void ttwu_queue_remote(struct task_struct *p, int cpu)
{
- if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list))
- smp_send_reschedule(cpu);
+ struct rq *rq = cpu_rq(cpu);
+
+ if (llist_add(&p->wake_entry, &rq->wake_list)) {
+ set_tsk_need_resched(rq->idle);
+ smp_mb__after_clear_bit();
+ if (!tsk_is_polling(rq->idle) || rq->curr != rq->idle)
+ smp_send_reschedule(cpu);
+ }
}

bool cpus_share_cache(int this_cpu, int that_cpu)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 14ca43430aee..bd8ed2d2f2f7 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -105,19 +105,24 @@ static void cpu_idle_loop(void)
} else {
local_irq_enable();
}
- __current_set_polling();
}
arch_cpu_idle_exit();
- /*
- * We need to test and propagate the TIF_NEED_RESCHED
- * bit here because we might not have send the
- * reschedule IPI to idle tasks.
- */
- if (tif_need_resched())
- set_preempt_need_resched();
}
+
+ /*
+ * We must clear polling before running sched_ttwu_pending().
+ * Otherwise it becomes possible to have entries added in
+ * ttwu_queue_remote() and still not get an IPI to process
+ * them.
+ */
+ __current_clr_polling();
+
+ set_preempt_need_resched();
+ sched_ttwu_pending();
+
tick_nohz_idle_exit();
schedule_preempt_disabled();
+ __current_set_polling();
}
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1bf34c257d3b..b59dbdb135d8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1157,9 +1157,10 @@ extern const struct sched_class rt_sched_class;
extern const struct sched_class fair_sched_class;
extern const struct sched_class idle_sched_class;

-
#ifdef CONFIG_SMP

+extern void sched_ttwu_pending(void)
+
extern void update_group_power(struct sched_domain *sd, int cpu);

extern void trigger_load_balance(struct rq *rq);
@@ -1170,6 +1171,8 @@ extern void idle_exit_fair(struct rq *this_rq);

#else /* CONFIG_SMP */

+static inline void sched_ttwu_pending(void) { }
+
static inline void idle_balance(int cpu, struct rq *rq)
{
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/