Re: [PATCH 0/3] Fix MCE handling for AMD multi-node processors

From: Borislav Petkov
Date: Wed Jan 07 2015 - 12:07:09 EST


On Tue, Jan 06, 2015 at 05:54:15PM -0600, Aravind Gopalakrishnan wrote:
> Hi Boris,
> It seems my earlier understanding of hardware behavior was not completely
> right.
> Here are some clarifications I have received after some internal discussion-
> When D18F3x44[NBMstToMstCpuEn] is set, the interrupt is also routed to the
> NBC.

Good :)

> This was not immediately clear to me from the description for the field in
> the BKDG.
> The BKDG states that errors are reported to the NBC and also that status,
> addr, ctl
> MSRs for MC4 are only accessible from the NBC.
> I took this to understand that the error info is written to the NBC MSRs
> while
> the #MC could be generated from the non-NBC.
>
> Now, given that setting NBMstToMstCpuEn ensures #MC is generated only on NBC
> for MC4 errors,
> we don't have a problem to solve in the #MC handler code.
> So, we can discard patch2 of the series,
>
> But we still need to change the error injection interfaces in mce_amd_inj:
> mce_amd_inj triggers a #MC on the cpu number that the user specifies on
> debugfs.
> For any error other than MC4 errors, this is fine.
> But we should really be triggering #MC only on NBC for MC4 errors.

Why?

As you said yourself, the errors get reported on the NBC. Where they get
*triggered* is a different story.

We do injection as it is described in "2.15.2 Error Injection and
Simulation" in F15h BKDG, for example. Reporting of the thusly injected
bank4 error goes to the NBC.

I don't see the need to fix anything in the code as it is.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/