RE: [PATCH v2 1/2] x86/mce/AMD: Redo use of SMCA MCA_DE{STAT,ADDR} registers

From: Ghannam, Yazen
Date: Fri Apr 07 2017 - 16:37:20 EST


> -----Original Message-----
> From: Borislav Petkov [mailto:bp@xxxxxxxxx]
> Sent: Wednesday, April 05, 2017 4:05 PM
> To: Ghannam, Yazen <Yazen.Ghannam@xxxxxxx>
> Cc: linux-edac@xxxxxxxxxxxxxxx; Tony Luck <tony.luck@xxxxxxxxx>;
> x86@xxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH v2 1/2] x86/mce/AMD: Redo use of SMCA
> MCA_DE{STAT,ADDR} registers
>
> On Wed, Apr 05, 2017 at 07:29:48PM +0000, Ghannam, Yazen wrote:
> > If it's set, then I expect a Deferred error in MCA_STATUS since any
> > Correctable Errors will be overwritten. Multiple bank types can
> > generate Deferred errors so there may also be cases where for some
> > types a valid Uncorrectable error happens and overwrites the Deferred
> > error before we can handle it. In which case we lose the Deferred error if
> we don't check MCA_DESTAT.
>
> So if we have an UE, wouldn't that raise an #MC? I guess in such cases we
> should concentrate only on the deferred errors and let the #MC handler deal
> with them. As we do now.
>

Right.

> > If it's not set, then it's possible to have a valid Correctable error
> > in MCA_STATUS while the valid Deferred error is in MCA_DESTAT.
>
> What's logging the CE? We probably should log it too before something
> overwrites it.
>

CEs will get picked up by polling or if we hit a threshold.

> Anyway, ok, I think I know what needs to happen now:
>
> amd_deferred_error_interrupt:
>
> if (__log_error_deferred(bank))
> return;
>
> This one read MC?_STATUS and does the logging for when the deferred error
> is in the normal MSRs. It returns true if it succeeded. It reads and hands down
> both MC?_STATUS and MC?_ADDR to __log_error() so that it doesn't have to
> read MC?_STATUS twice.
>

Okay. But do we need a return value? I'm thinking we can go through all banks
and log any and all Deferred errors rather than just the first one we find. I asked
and this is the preferred method like how we do in the #MC handler. The same
applies to the thresholding interrupt handler.

> If __log_error_deferred() has read a different type of error, we still log it? I'm
> not sure about this. I guess we can ignore that case for now.
>

Yeah, we should ignore other error types. They'll get picked up by other methods.

> Then:
>
> if (mca_flags.smca)
> __log_error_deferred_smca(bank));
>
> which handles the SMCA case. It too reads MSR_AMD64_SMCA_MCx_DESTAT
> and MSR_AMD64_SMCA_MCx_DEADDR and hands them down to
> __log_error() for logging.
>

Can this just be included in the __log_error_deferred()? I'm thinking something
similar to one of your earlier suggestions. Like this:

static void
__log_error_deferred(unsigned int bank)
{
u64 status, addr;
bool logged = false;

rdmsrl(msr_ops.status, status);

if (is_deferred_error(status)) {
rdmsrl(msr_ops.addr, addr);

__log_error(bank, status, addr, 0);

wrmsrl(msr_ops.status, 0);

logged = true;
}

if (!mca_flags.smca)
return;

if (logged) {
wrmsrl(MSR_AMD64_SMCA_MCx_DESTAT, 0);
return;
}

rdmsrl(MSR_AMD64_SMCA_MCx_DESTAT(bank), status);

if (is_deferred_error(status)) {
rdmsrl(MSR_AMD64_SMCA_MCx_DEADDR, addr);

__log_error(bank, status, addr, 0);

wrmsrl(MSR_AMD64_SMCA_MCx_DESTAT, 0);
}
}

> For the __log_error() call in amd_threshold_interrupt(), you define a
> log_error() wrapper which reads the default MSRs and hands them down to
> __log_error().
>

Okay.

> So __log_error() always gets STATUS and ADDR MSR values and it doesn't
> need to read them from the MSRs but only log them.
>

Okay.

Thanks,
Yazen