Re: [PATCH 3/6] x86/mce: Add support for new MCA_SYND register
From: Borislav Petkov
Date: Fri Jul 08 2016 - 06:48:40 EST
On Fri, Jul 08, 2016 at 12:26:48PM +0200, Ingo Molnar wrote:
> So is 'ECC syndrome' a fancy word and a complicated process for
> identifying what data got corrupted, in a more accurate fashion than
> what we had before?
The syndrome has always been there - even since K8 at least. This patch
is simply adding the change that on SMCA systems it should be read from
a different MSR.
The syndrome is part of the magic math behind Error Correction Codes
which can be used to point to which bits in the word in that memory
address were flipped.
OOOOh wait a minute!
I'm just getting the sickest idea:
@Yazen, is that SMCA syndrome max 16 bits on SMCA? Because if so - and I
would bet good money it is so - then we can stuff it into its old place
in the MCI_STATUS register part of struct mce, i.e. mce->status.
And then you won't need to touch the tracepoint and any of that.
Because you do:
rdmsrl(MSR_AMD64_SMCA_MCx_SYND(bank), m.synd)
and I'll venture a good guess that that whole 64 bits MSR is not the
syndrome.
Right?
If I'm right, all those patches adding syndrome support need to be
reworked.
> Because previously we already had a memory address of the memory
> corruption, right?
We've always had the address and the syndrome. The syndrome is in
MCI_STATUS on older machines.
> What is the typical 'scope' of that memory corruption address - a
> cache line, a machine word, a byte or maybe a variable unit that is
> memory hardware dependent?
Typically 128 bit as the example above shows. The syndrome covers those
whole 128 bit. AFAIR(!), DRAM accesses are always done in 128 bit words
even if less is being read. All nicely hidden by the DRAM controller.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.