Re: 2.6.30-git(16 and 17) system hangs after resume from suspendto disk, mce related?

From: Andi Kleen
Date: Mon Jun 22 2009 - 02:43:46 EST

Hidetoshi Seto wrote:
Maciej Rutecki wrote:
Also a "a few minutes" suggest something might be going wrong
with the poll handler. Does the problem still happen
with you use CONFIG_X86_NEW_MCE again, but before
resume do

echo 0 > /sys/device/system/machinecheck/machinecheck0/check_interval

On the other hand you should get a crash very fast with

echo 1 > /sys/device/system/machinecheck/machinecheck0/check_interval
I didn't instructions from above, but I found something else. After
normal boot I try:

echo 1 > /sys/devices/system/machinecheck/machinecheck0/check_interval

I I found this in dmesg:

[ 141.704025] ------------[ cut here ]------------
[ 141.704039] WARNING: at arch/x86/kernel/cpu/mcheck/mce.c:1102

I see. At least this warning will be cleared by following patch.
WARN_ON(smp_processor_id() != data);

But I'm not sure whether this can cause system hangs or not.

It might actually. If two different handlers run on the same CPU
they could re-add a timer twice, which might cause loops in the timer
list etc.

Maciej, can you test Seto-san's patch please?

BTW this is probably related to

commit eea08f32adb3f97553d49a4f79a119833036000a
Author: Arun R Bharadwaj <arun@xxxxxxxxxxxxxxxxxx>
Date: Thu Apr 16 12:16:41 2009 +0530

timers: Logic to move non pinned timers

it might be also useful to test if reverting that patch makes
the problem go away. But with this patch we need the add_timer_on change.

