Re: Bisected KVM hang on x86-32 between v3.12 and v3.13
From: Peter Zijlstra
Date: Mon Apr 07 2014 - 11:04:04 EST
On Sun, Apr 06, 2014 at 05:19:27PM +0200, Michele Ballabio wrote:
> Toralf Förster reported this in
> http://article.gmane.org/gmane.linux.kernel/1662567
> http://article.gmane.org/gmane.linux.kernel/1658422
> http://article.gmane.org/gmane.linux.kernel/1657962
>
> "The issue happens here at a 32 bit stable Gentoo Linux if
> I try to start a KVM image. Kernels 3.12.X works fine,
> kernel >= v3.13 will hang shortly after I started the image
> with the virtual-manager. The last syslog messages are
> something like:
> Feb 28 16:22:00 n22 kernel: INFO: rcu_sched detected stalls
> on CPUs/tasks: {} (detected by 2, t=60002 jiffies,
> g=14689, c=14688, q=21051)
> Feb 28 16:22:00 n22 kernel: INFO: Stall ended before state
> dump start"
>
> He correctly pointed out that the bisection blamed the merge
> commit 37bf06375c90a42fe07b9bebdb07bc316ae5a0ce
> "Merge tag 'v3.12-rc4' into sched/core".
>
> This bug is obviously caused by at least two patches, one
> on each side of the merge, that only when combined together
> (at that merge point) cause the bug in kvm. By rebasing
> the "sched/core" branch on "master" before the merge and
> going on with the bisection, I found commit
> 3e8e42c69bb7d9fc12ebc23ff308e8523a2a59a0
> "sched: Revert need_resched() to look at TIF_NEED_RESCHED"
> as one of the causes. The other patch that contributes to the
> bug is commit ded797547548a5b8e7b92383a41e4c0e6b0ecb7f
> "irq: Force hardirq exit's softirq processing on its own stack".
>
> Reverting either one of them solves the problem reported with kvm,
> but revert is probably not the correct answer.
>
> I wonder if the solution is as simple as this:
>
> --->8---
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 0af5250..f3b985d 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -126,6 +126,7 @@ config X86
> select RTC_LIB
> select HAVE_DEBUG_STACKOVERFLOW
> select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64
> + select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_32
> select HAVE_CC_STACKPROTECTOR
Ohh ahh.. shiney!
So what I suspect at this point is that because i386 and x86_64 have a
difference in current_thread_info() (i386 is stack based), we end up
setting the TIF_NEED_RESCHED bit on the wrong stack.
Now I have some vague memories of propagating the TIF flags on stack
switch, but I cannot remember what arch we did that for. Let me stare at
this a little more.
Also, IFF this is the case, then the fingered patch above (and your
suggested 'fix') aren't the real curlpit/cure but simply make it
more/less likely to happen.
Now, Steve had a patch somewhere that would make i386 use per-cpu
variables for current_thread_info() just like x86_64 already does I
think. Let me go find them too.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/