Re: x86/mce: machine check warning during poweroff

From: Srivatsa S. Bhat
Date: Wed Jan 18 2012 - 08:16:10 EST


On 01/18/2012 08:47 AM, Suresh Siddha wrote:

> On Tue, 2012-01-17 at 15:22 +0530, Srivatsa S. Bhat wrote:
>> Thanks for the patch, but unfortunately it doesn't fix the problem!
>> Exactly the same stack traces are seen during a CPU Hotplug stress test.
>> (I didn't even have to stress it - it is so fragile that just a script
>> to offline all cpus except the boot cpu was good enough to reproduce the
>> problem easily.)
>
> hmm, that's weird. with the patch, sched_ilb_notifier() should have
> cleared the cpu going offline from the nohz.idle_cpus_mask. And this
> should have happened after that cpu is removed from active mask. So
> no-one else should add that cpu back to the nohz.idle_cpus_mask and this
> should prevent the issue from happening.
>
> I could reproduce the problem easily with out the patch but when I
> applied the patch I couldn't recreate the issue. Srivatsa, can you
> please re-check the kernel you tested indeed has the fix?
>
> re-Reviewing the code/patch also doesn't give me a hint.
>
>> I have a few questions regarding the synchronization with CPU Hotplug.
>> What guarantees that the code which selects and IPIs the new ilb is totally
>> race-free with respect to CPU hotplug and we will never IPI an offline CPU?
>
> So, nohz_balancer_kick() gets called only from interrupts disabled.
> During that time (from selecting the ilb_cpu to sending the IPI), no cpu
> can go offline. As the offline happens from the stop-machine process
> context with interrupts disabled.
>
> Only thing we need to make sure is the offlined cpu shouldn't be part of
> the nohz.idle_cpus_mask and for post 3.2 code, posted patch ensures
> that.
>
> For 3.2 and before, when a cpu exits tickless idle, it gets removed from
> the nohz.idle_cpus_mask (and also from the nohz.load_balancer). And if
> the cpu is not in the active mask (while going offline), subsequent
> calls to select_nohz_load_balancer() ensures that the cpu going down
> doesn't update the nohz structures. So I thought 3.2 shouldn't exhibit
> this problem.
>
>
>> (As demonstrated above, this issue is in 3.2-rc7
>> as well.)
>
> hmm, don't think we ran into this before 3.2. So, what am I missing from
> the above? I will try to reproduce it on 3.2 too.
>


I tested again on 3.2. I didn't hit those warnings (IPI to offline cpus).
It happens only in the post-3.2 kernel.

Regards,
Srivatsa S. Bhat
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/