RE: [PATCH 1/2] Revert "x86/mce/AMD: Collect error info even if valid bits are not set"

From: Ghannam, Yazen
Date: Mon Mar 26 2018 - 15:59:01 EST


> -----Original Message-----
> From: linux-edac-owner@xxxxxxxxxxxxxxx <linux-edac-
> owner@xxxxxxxxxxxxxxx> On Behalf Of Borislav Petkov
> Sent: Monday, March 26, 2018 3:31 PM
> To: Ghannam, Yazen <Yazen.Ghannam@xxxxxxx>
> Cc: linux-edac@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> tony.luck@xxxxxxxxx; x86@xxxxxxxxxx
> Subject: Re: [PATCH 1/2] Revert "x86/mce/AMD: Collect error info even if
> valid bits are not set"
>
> On Mon, Mar 26, 2018 at 02:15:25PM -0500, Yazen Ghannam wrote:
> > From: Yazen Ghannam <yazen.ghannam@xxxxxxx>
> >
> > This reverts commit 4b1e84276a6172980c5bf39aa091ba13e90d6dad.
> >
> > Software uses the valid bits to decide if the values can be used for
> > further processing or other actions. So setting the valid bits will have
> > software act on values that it shouldn't be acting on.
> >
> > The recommendation to save all the register values does not mean that
> > the values are always valid.
>
> So what does that
>
> "Error handlers should save the values in MCA_ADDR, MCA_MISC0,
> and MCA_SYND even if MCA_STATUS[AddrV], MCA_STATUS[MiscV], and
> MCA_STATUS[SyndV] are zero."
>
> *actually* mean then?
>
> It is still in the PPR.
>

We should always save as much of the error state as we can even if we
can't act upon it. Basically, we don't ever want to lose information in the
case of some unforeseen issue in the reporting mechanisms or something
else. There aren't any issues that require this change at the moment. But
I think the Design folks are being more conservative in ensuring that all
possible data is collected.

So at a minimum, we should always save and report as much as we can.
But we don't try any recovery actions unless we're sure the data is valid.

Thanks,
Yazen