RE: [PATCH] x86/MCE, EDAC/mce_amd: Save all aux registers on SMCA systems

From: Ghannam, Yazen
Date: Fri Apr 20 2018 - 09:05:24 EST


> -----Original Message-----
> From: Borislav Petkov <bp@xxxxxxxxx>
> Sent: Wednesday, April 18, 2018 1:14 PM
> To: Ghannam, Yazen <Yazen.Ghannam@xxxxxxx>
> Cc: linux-edac@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> tony.luck@xxxxxxxxx; x86@xxxxxxxxxx
> Subject: Re: [PATCH] x86/MCE, EDAC/mce_amd: Save all aux registers on
> SMCA systems
>
> On Tue, Apr 17, 2018 at 06:30:34PM +0000, Ghannam, Yazen wrote:
> > We could but it's an issue of documentation and testing the older systems.
> >
> > My first pass at this was to unconditionally read the registers because my
> > understanding was that registers that aren't accessible would be read-as-
> zero.
> > I thought this was a common MCA implementation. But Tony pointed out
> that
> > this isn't the case on Intel systems. This is the case on recent AMD systems.
> But
> > I don't know if it's the case on older systems which may or may not have
> > followed the Intel implementation more closely.
>
> So if our worry is the #GPs, we can always use the rdmsr*_safe()
> variants and look at the return value. And dump a invalid value like
> 0xdeadbeef or so, if the read failed.
>
> But if any bit of info we've gotten this way, helps us debug an MCE,
> we're already golden!
>

Okay, I can do that. What about using mce_rdmsrl()? The value gets set to
0 and a user gets a single warning. This may be more clear to the user. Also,
it shouldn't affect code that checks for non-zero values, like in __print_mce().

> > For example,
> >
> > Deferred error occurs:
> > - MCA_{STATUS,ADDR,DESTAT,DEADDR} all have valid data.
> >
> > MCE occurs
> > - MCA_{STATUS,ADDR} are overwritten with non-zero data.
> > - MCE handler clears MCA_STATUS. MCA_ADDR is non-zero.
> >
> > DFR handler finds MCA_STATUS[Deferred] is clear, so it saves
> > MCA_DESTAT and MCA_DEADDR which is 0.
> >
> > If !m->addr (which has MCA_DEADDR), then we read MCA_STATUS
> > which has the address from the MCE.
>
> The code could use a shorter version of this as a comment to state why
> we're doing it. Because it is not obvious.
>

Yes, will do.

Thanks,
Yazen