Re: [PATCH v2] x86/mce: Distirbute the clear operation of mces_seen to Per-CPU rather than only monarch CPU

From: Tony Luck
Date: Fri May 23 2014 - 18:41:06 EST


On Fri, May 23, 2014 at 4:57 AM, Chen Yucong <slaoub@xxxxxxxxx> wrote:
> If (mca_cfg.tolerant == 2 || mce_cfg.tolerant == 3), what can you do for
> it?

Maybe we need to look again at the effects of "tolerant" - and maybe
specify what happens at various levels, There are some obvious
silly bits of code (picking one that is my fault):
if (cfg->tolerant < 3) {
if (no_way_out)
mce_panic("Fatal machine check on current
CPU", &m, msg);
if (worst == MCE_AR_SEVERITY) {
/* schedule action before return to userland */
mce_save_info(m.addr, m.mcgstatus & MCG_STATUS_RIPV);
set_thread_flag(TIF_MCE_NOTIFY);
} else if (kill_it) {
force_sig(SIGBUS, current);
}
}

Why is the MCE_AR_SEVERITY recovery code not even attempted
if tolerant is >=3? That block of code dates back to before there were
any recoverable cases ... so the insane option of just ignoring the error
and hoping that the end result wasn't too bad made some sort of sense
when compared against a machine crash and not getting any answer at
all.

Or one that Andi pointed out years ago (and had a fix in a tree for):

if (order == 1) {
/* CHECKME: Can this race with a parallel hotplug? */
int cpus = num_online_cpus();

/*
* Monarch: Wait for everyone to go through their scanning
* loops.
*/
while (atomic_read(&mce_executing) <= cpus) {

What if some cpus were offline when this machine check arrived?
Our "offline" code doesn't do anything to the h/w to prevent those
cpus from joining in the machine check fun. So we'll see more than
num_online_cpus() processors arrive to process the machine check.
Andi's fix was in the start of do_machine_check() and just had each
cpu that showed up check whether it was listed as "online" by Linux.
If not, it just cleared MCG_STATUS and returned. I didn't apply it
because I thought we needed to be a bit more robust (what if the offline
cpu actually did have a problem? ... we should at least check that
MCG_STATUS.RIPV=1 before rashly returning ... perhaps even more
tests are needed if the cpu had never been online at all).

So I'm happy that you are taking an interest in machine check code.
I think there are places where it can be made a lot better. I don't
think that moving where mces_seen gets cleared is one of those
places.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/