Re: 2.6.30-git(16 and 17) system hangs after resume from suspendto disk, mce related?

From: Hidetoshi Seto
Date: Mon Jun 22 2009 - 23:41:20 EST


Maciej Rutecki wrote:
> 2009/6/22 Andi Kleen <ak@xxxxxxxxxxxxxxx>:
>
>> Here's a debug patch for the poller: http://firstfloor.org/ak/mcp-debug
>> Can you apply that and try again and send me the output?
>>
>
> Dmesg after resume:
> http://unixy.pl/maciek/download/kernel/2.6.30-git17/pc/dmesg-2.6.30-git17-patch.txt
>
> System hangs when uptime is roughly 5-6 minutes (when I don't change
> check_interval). netconsole doesn't show anything.
>

I found in the dmesg that mce_init() and mce_cpu_features() are called
on cpu0 twice in short time:

[ 82.989005] mcp on cpu 0 flags 2 banks ecc39e70
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:502
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:506
[ 82.989005] bank 0
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 1
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 2
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 3
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 4
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 5
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] mcp on cpu 0 finished
[ 82.989005] CPU0: Thermal LVT vector (0xfa) already installed
[ 82.989005] PM: Restoring platform NVS memory
[ 82.989005] mcp on cpu 0 flags 2 banks ecc39e70
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:502
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:506
[ 82.989005] bank 0
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 1
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 2
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 3
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 4
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] bank 5
[ 82.989005] [0] arch/x86/kernel/cpu/mcheck/mce.c:518
[ 82.989005] mcp on cpu 0 finished
[ 82.989005] CPU0: Thermal LVT vector (0xfa) already installed

mce_cpu_features() (which prints "Thermal ...") is always paired with
mce_init(), and is called only from mcheck_init() and mce_resume().

One of the above would be from mce_resume(), and if another was from
mcheck_init(), then setup_timer() in mce_init_timer() will break the
pending timer...

[arch/x86/power/cpu.c]
> static void __restore_processor_state(struct saved_context *ctxt)
> {
> :
> #ifdef CONFIG_X86_32
> mcheck_init(&boot_cpu_data);
> #endif
> }

Hum?

Maciej, could you try this patch?

Thanks,
H.Seto

===
[PATCH] x86: Fix mce resume on 32bit

Calling mcheck_init() on resume is required only with CONFIG_X86_OLD_MCE=y.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@xxxxxxxxxxxxxx>
---
arch/x86/power/cpu.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index d277ef1..b3d20b9 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -244,7 +244,7 @@ static void __restore_processor_state(struct saved_context *ctxt)
do_fpu_end();
mtrr_ap_init();

-#ifdef CONFIG_X86_32
+#ifdef CONFIG_X86_OLD_MCE
mcheck_init(&boot_cpu_data);
#endif
}
--
1.6.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/