Re: [patch 2/2] x86: mce: Implement cmci poll mode for intelmachines

From: Thomas Gleixner
Date: Wed Jun 06 2012 - 06:23:37 EST


On Wed, 6 Jun 2012, Thomas Gleixner wrote:

> On Wed, 6 Jun 2012, Chen Gong wrote:
> > ä 2012/6/5 21:35, Thomas Gleixner åé:
> > I add some print in timer callback, it shows:
> >
> > smp_processor_id() = 0, mce_timer_fn data(CPU id) = 10
> > timer->function = ffffffff8102c200, timer pending = 1, CPU = 0
> > (add_timer_on, BUG!!!)
>
> Sure. That's not a surprise. The timer function for cpu 10 is called
> on cpu 0. And the timer function does:
>
> struct timer_list *t = &__get_cpu_var(mce_timer);
>
> which gets a pointer to the timer of cpu0. And that timer is
> pending. So yes, it's exploding for a good reason.
>
> Though, this does not tell us how the timer of cpu10 gets on cpu0.
>
> Did you do any cpu hotplug operations ?

There's a problem in the hotplug code.

case CPU_DOWN_PREPARE:
case CPU_DOWN_PREPARE_FROZEN:
del_timer_sync(t);
smp_call_function_single(cpu, mce_disable_cpu, &action, 1);
break;

We delete the timer before we disable mce and cmci. So if the cmci
interrupt kicks the timer after del_timer_sync() and before
mce_disable_cpu() is called on the other core, then the timer is still
enqueued when the cpu goes down. After it's dead the timer is migrated
and then the above scenario happens.

Can you try the following just for a quick test ?

case CPU_DOWN_PREPARE:
case CPU_DOWN_PREPARE_FROZEN:
del_timer_sync(t);
smp_call_function_single(cpu, mce_disable_cpu, &action, 1);
+ del_timer_sync(t);
break;

That's not a proper solution, for a proper solution the hotplug code
of mce needs an overhaul. It's patently ugly.

Thanks,

tglx