Re: [PATCH -v3 3/4] MCE, CE: Wire in the CE collector

From: Max Asbock
Date: Mon Jul 07 2014 - 13:00:23 EST


On Tue, 2014-07-01 at 21:23 +0200, Borislav Petkov wrote:
> From: Borislav Petkov <bp@xxxxxxx>
>
> Add the CE collector to the polling path which collects the correctable
> errors. Collect only DRAM ECC errors for now.
>
> Signed-off-by: Borislav Petkov <bp@xxxxxxx>
> ---
> arch/x86/kernel/cpu/mcheck/mce.c | 84 ++++++++++++++++++++++++++++++++++++----
> 1 file changed, 76 insertions(+), 8 deletions(-)
>

> +
> +static void __log_ce(struct mce *m, enum mcp_flags flags)
> +{
> + /*
> + * Don't get the IP here because it's unlikely to have anything to do
> + * with the actual error location.
> + */

The above comment doesn't belong here. This function is about how to
dispose of corrected errors, not what data should be collected.
(Besides this comment is obsolete as the IP is always collected in
mce_gather_info()).

> + if ((flags & MCP_DONTLOG) || mca_cfg.dont_log_ce)
> + return;
> +

> + if (dram_ce_error(m)) {
> + /*
> + * In the cases where we don't have a valid address after all,
> + * do not collect but log.
> + */
> + if (!(m->status & MCI_STATUS_ADDRV))
> + goto log;
> +
> + mce_ring_add(&__get_cpu_var(ce_ring), m->addr >> PAGE_SHIFT);
> + return;
> + }
> +
> +log:
> + mce_log(m);
> +}
The above code is a bit convoluted, it amounts to:

if (we have a corrected dram error && we have an address for it)
mce_ring_add()
else
mcelog()

Is that the intention? This might be problematic for downstream
consumers of the errors such as the EDAC drivers which keep counts of
errors. If errors are silently removed from the stream these counts will
be bogus. Somebody might wonder why a page was off-lined while the EDAC
driver reports zero corrected DRAM error counts.

- Max


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/