RE: [PATCH v2] x86/mce: Distirbute the clear operation of mces_seen to Per-CPU rather than only monarch CPU

From: Luck, Tony
Date: Wed May 21 2014 - 17:09:52 EST


>> mce_regin, which is only called by monarch CPU, can be used for system
>> panics as quickly as possible if there is a truly data corrupting error.
>> But Monarch CPU don't have to help all other CPU to clean mces_clean.
>> One advantage of Per-CPU is the isolation of errors propagation, being
>> so, why do not we clean mces_seen by Per-CPU?
>
> What kind of error propagations are you expecting/concerning here?
> Could you explain the problem more in detail?

Please do give us more detail on the scenario that you see that would
make your new version behave better.

I'm sure the current code has no races w.r.t. clearing mces_seen. The
monarch clears them all in mce_reign() before clearing mce_executing
at the foot of mce_end() and allowing the others to run again.

Your code has the monarch release all the other cpus from the spinloop
in mce_end() so they will all rush together through the final lines of
do_machine_check(). Some of them will have work to do if they saw
errors - they may have to send signals, or log the error. Others can
fly directly to the end of do_machine_check() and clear MCG_STATUS
and return to executing whatever code was interrupted.

So it is possible that some processors will be out doing things that can
generate another machine check, before others have finished their
tasks and got to the point to clear mces_seen.(*)

-Tony

(*) maybe that doesn't matter because they haven't zeroed MCG_STATUS
yet - so this second machine check will force those cpus to shutdown. See MCIP
description in "15.3.1.2 IA32_MCG_STATUS_MSR" section of software
developer manual.