On Thu, Jun 06, 2013 at 02:29:49PM -0700, greearb@xxxxxxxxxxxxxxx wrote:The counter also helps to keep the interrupted task interrupted a shorter period of time. 10 iterations may be a lot shorter than the 2 ms, or 10 ms with HZ=100, so it helps interactivity also. This is a good change to bring back in any case.From: Ben Greear <greearb@xxxxxxxxxxxxxxx>...
The stop machine logic can lock up if all but one of
the migration threads make it through the disable-irq
step and the one remaining thread gets stuck in
__do_softirq. The reason __do_softirq can hang is
that it has a bail-out based on jiffies timeout, but
in the lockup case, jiffies itself is not incremented.
To work around this, re-add the max_restart counter in __do_irq
and stop processing irqs after 10 restarts.
Thanks to Tejun Heo and Rusty Russell and others for
helping me track this down.
This was introduced in 3.9 by commit: c10d73671ad30f5469
(softirq: reduce latencies).
It may be worth looking into ath9k to see if it has issues with
it's irq handler at a later date.
The hang stack traces look something like this:Signed-off-by: Ben Greear <greearb@xxxxxxxxxxxxxxx>
Acked-by: Tejun Heo <tj@xxxxxxxxxx>
Linus, while this doesn't fix the root cause of the problem - softirq
runaway - I still think this is a worthwhile protection to have. Ben
is in the process of finding out why the softirq runaway happens in
the first place. We probably want to add Cc: stable@xxxxxxxxxxxxxxx
tag.