Re: [PATCH] Fix tasks being forgotten for a long time on SMP

From: Peter Zijlstra
Date: Fri Sep 23 2016 - 04:15:17 EST


On Thu, Sep 22, 2016 at 12:06:03PM -0700, Yuriy Romanenko wrote:
> Hello,
>
> I just spent three hours studying TOT scheduler code, and I am leaning
> towards that this patch being no longer relevant (or being never
> relevant, really).
> If it ever fixed a real issue, I think it fixed it incorrectly. However, I would
> love to understand what exactly we were seeing and why this helped at all.
>
> So a little background on this. I found this change I made that I failed to
> upstream from a few years ago, so my memory is a little rusty. This
> was definitely an issue on the setup we had, which was an arch/arm
> mach-msm (now known as mach-qcom, I believe) fork of 3.4.0

Urgh, 3.4 is a _long_ time ago ;-) Let me check it out to remind myself
what it looked like.

> > WTH his SWFI and which arch has that?
>
> SWFI is "Stop and Wait For Interrupt". This appears to be a qcom
> specific term that they stopped using, so I apologize for that. It
> specifically refers to a CPU that is sitting in some sort of a "wait"
> or "wfi" instruction and not executing anything until an interrupt
> occurs.

Ah, ok. So I'm somewhat familiar with the ARM 'WFI' instruction and had
a vague idea this might be an ARM related part.

> > If the remote cpu is running the idle task, check_preempt_curr() should
> > very much wake it up, if its not the idle class, it should never get
> > there because there is now an actually runnable task on.
>
> > Please explain in detail what happens and/or provide traces of this
> > happening.
>
> We observed two relevant scenarios that are similar but distinct.
> Looking through top-of-tree code more carefully I don't see either as
> possible anymore, so the patch probably does not apply, but I would
> still defer to your expertise
>
> We have an SCHED_FIFO thread that falls asleep in a poll() waiting for
> an interrupt from a device driver.
>
> 1) It is the last thread on that CPU, the CPU enters idle and goes
> into a 'wfi' instruction because of cpuidle_idle_call(), and will not
> wake up until it receives an interrupt of some sort (like IPI). The
> interrupt occurs on a different CPU, the task is woken, but doesn't
> run until much later.
>
> 2) It is not the last thread on that CPU. The CPU switches to a lower
> priority task in a different class. The interrupt occurs on a
> different CPU, the task is woken, but doesn't preempt until much
> later.
>
> Unfortunately I don't have the ftrace logs from back then, and it
> would be hard to recreate the scenario.
>
> If I recall correctly, the problem seemed to occur because the task
> cpus_share_cache() was true, the task was never put on the wake_list,
> and correspondingly the scheduler_ipi() that was triggered by
> check_preempt_curr() did nothing, when it returned it would somehow
> wind up in schedule() where it did nothing because task was not
> TASK_RUNNING. Does that make any amount of sense at all?

So if cpus_share_cache() is true, we'll do the remote enqueue, that is,
the waking CPU will acquire the rq->lock of the CPU we want to wake the
task on and enqueue.

The code is like:

raw_spin_lock(rq->lock);
ttwu_do_activate(rq, p, 0);
ttwu_activate(...); // does the actual enqueue
ttwu_do_wakeup(...);
check_preempt_curr() // <-- this is supposed to issue the IPI

raw_spin_unlock(rq->lock);

In either scenario, 'current' is of a lower scheduling class (idle or
fair) than the newly woken task (fifo), so check_preempt_curr() should
find (class == p->sched_class) true and issue resched_task(rq->curr).

resched_task() in its turn tests TIF_NEED_RESCHED, if not set, it will
set it and issue an IPI (lets ignore the polling thing for now).

The IPI will execute scheduler_ipi(), and as you say, fail to find work
to do -- this is expected and OK.

_However_ on the return to user path (from interrupt), you're supposed
to check TIF_NEED_RESCHED and call schedule(). _THIS_ is what should
affect the actual task switch and get our freshly woken FIFO task
running.

Not to be confused with CONFIG_PREEMPT which should check
TIF_NEED_RESCHED on any return from interrupt path and call
preempt_schedule().