Re: [PATCH v5 2/2] x86/mce: Add support for Extended Physical Address MCA changes

From: Yazen Ghannam
Date: Thu Jun 09 2022 - 15:19:46 EST


On Fri, Apr 15, 2022 at 09:37:57AM -0700, Luck, Tony wrote:
> On Fri, Apr 15, 2022 at 02:56:54PM +0000, Yazen Ghannam wrote:
> > 3) OS, or optionally BIOS, polls MCA banks and logs any valid errors.
> > a) Since MCi_CTL, etc. are cleared due to reset, any errors detected are
> > from before the reset.
>
> On Intel not quite any error. H/w can still log to a bank but MCi_STATUS.EN bit
> will be zero. We've also had some BIOS code that did things that logged errors
> and then left them for the OS to find during boot.
>
> But this sequence does give more confidence that errors found in banks duing
> boot are "old".
>
> > I agree. The Intel SDM and AMD APM have the following procedure, in summary.
> >
> > 1) Set MCG_CTL
> > 2) Set MCi_CTL for all banks
> > 3) Read MCi_STATUS and log valid errors.
> > 4) Clear MCi_STATUS
> > 5) Set CR4.MCE
>
> Yes. That's what the pseudo-code in Intel SDM Example 15-1 says :-(
> >
> > I don't know of a reason why STATUS needs to be cleared after MCi_CTL is set.
> > The only thing I can think of is that enabling MCi_CTL may cause spurious info
> > logged in MCi_STATUS, and that needs to be cleared out. I'm asking AMD folks
> > about it.
> >
> > Of course, this contradicts the flow I outlined above, and also the flow given
> > in the AMD Processor Programming Reference (PPR). I wonder if the
> > architectural documents have gotten stale compared to current guidelines. I'm
> > asking about this too.
>
> I will ask architects about this sequence too.
>

Hi everyone,
It looks like the discrepancy between the Linux code and the x86 documents
isn't a major concern for AMD systems. However, it is highly recommended that
the banks are polled before enabling MCA to find any errors from before OS
boot. It is possible that BIOS may enable MCA before the OS on some systems,
but this isn't always the case.

Tony,
Did you get any feedback regarding the sequence above?

Also, please see the patch below which is based on Boris' patch from earlier
in this thread.

Thanks,
Yazen

-------