Re: Too many rescheduling interrupts (still!)
From: Andy Lutomirski
Date: Tue Feb 11 2014 - 17:34:41 EST
On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> On Tue, 11 Feb 2014, Andy Lutomirski wrote:
>
> Just adding Peter for now, as I'm too tired to grok the issue right
> now.
>
>> Rumor has it that Linux 3.13 was supposed to get rid of all the silly
>> rescheduling interrupts. It doesn't, although it does seem to have
>> improved the situation.
>>
>> A small number of reschedule interrupts appear to be due to a race:
>> both resched_task and wake_up_idle_cpu do, essentially:
>>
>> set_tsk_need_resched(t);
>> smb_mb();
>> if (!tsk_is_polling(t))
>> smp_send_reschedule(cpu);
>>
>> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
>> is too quick (which isn't surprising if it was in C0 or C1), then it
>> could *clear* TS_POLLING before tsk_is_polling is read.
>>
>> Is there a good reason that TIF_NEED_RESCHED is in thread->flags and
>> TS_POLLING is in thread->status? Couldn't both of these be in the
>> same field in something like struct rq? That would allow a real
>> atomic op here.
>>
>> The more serious issue is that AFAICS default_wake_function is
>> completely missing the polling check. It goes through
>> ttwu_queue_remote, which unconditionally sends an interrupt.
There would be an extra benefit of moving the resched-related bits to
some per-cpu structure: it would allow lockless wakeups.
ttwu_queue_remote, and probably all of the other reschedule-a-cpu
functions, could do something like:
if (...) {
old = atomic_read(resched_flags(cpu));
while(true) {
if (old & RESCHED_NEED_RESCHED)
return;
if (!(old & RESCHED_POLLING)) {
smp_send_reschedule(cpu);
return;
}
new = old | RESCHED_NEED_RESCHED;
old = atomic_cmpxchg(resched_flags(cpu), old, new);
}
}
The point being that, with the current location of the flags, either
an interrupt needs to be sent or something needs to be done to prevent
rq->curr from disappearing. (It probably doesn't matter if the
current task changes, because TS_POLLING will be clear, but what if
the task goes away entirely?)
All that being said, it looks like ttwu_queue_remote doesn't actually
work if the IPI isn't sent. The attached patch appears to work (and
reduces total rescheduling IPIs by a large amount for my workload),
but I don't really think it's worthy of being applied...
--Andy
From 9dfa6a99e5eb5ab0bc3a4d6beb599ba0f2f633af Mon Sep 17 00:00:00 2001
Message-Id: <9dfa6a99e5eb5ab0bc3a4d6beb599ba0f2f633af.1392157722.git.luto@xxxxxxxxxxxxxx>
From: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
Date: Tue, 11 Feb 2014 14:26:46 -0800
Subject: [PATCH] sched: Try to avoid sending an IPI in ttwu_queue_remote
This is an experimental patch. It should probably not be applied.
Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
---
kernel/sched/core.c | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a88f4a4..fc7b048 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1475,20 +1475,23 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
}
#ifdef CONFIG_SMP
-static void sched_ttwu_pending(void)
+static void __sched_ttwu_pending(struct rq *rq)
{
- struct rq *rq = this_rq();
struct llist_node *llist = llist_del_all(&rq->wake_list);
struct task_struct *p;
- raw_spin_lock(&rq->lock);
-
while (llist) {
p = llist_entry(llist, struct task_struct, wake_entry);
llist = llist_next(llist);
ttwu_do_activate(rq, p, 0);
}
+}
+static void sched_ttwu_pending(void)
+{
+ struct rq *rq = this_rq();
+ raw_spin_lock(&rq->lock);
+ __sched_ttwu_pending(rq);
raw_spin_unlock(&rq->lock);
}
@@ -1536,8 +1539,15 @@ void scheduler_ipi(void)
static void ttwu_queue_remote(struct task_struct *p, int cpu)
{
- if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list))
- smp_send_reschedule(cpu);
+ struct rq *rq = cpu_rq(cpu);
+
+ if (llist_add(&p->wake_entry, &rq->wake_list)) {
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ resched_task(cpu_curr(cpu));
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+ }
}
bool cpus_share_cache(int this_cpu, int that_cpu)
@@ -2525,6 +2535,8 @@ need_resched:
smp_mb__before_spinlock();
raw_spin_lock_irq(&rq->lock);
+ __sched_ttwu_pending(rq);
+
switch_count = &prev->nivcsw;
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
if (unlikely(signal_pending_state(prev->state, prev))) {
--
1.8.5.3