Re: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info Structure

From: Borislav Petkov
Date: Tue Feb 27 2018 - 17:11:11 EST


On Tue, Feb 27, 2018 at 09:32:18PM +0000, Ghannam, Yazen wrote:
> Not much more readable. It's still vague and confusing to a user and devoid
> of any real info so an expert can't help. And now the information is printed
> arbitrarily, so someone needs to read the source to figure out what it really
> means.

WTF? You need to read the source to figure out what the error is? So
"Corrected Processor Error" is confusing? I think you've been clearly
staring at the spec for waaay too long.

Also, read my mail again:

"Now, I admit that my vesion of the record is not enough to debug it
but it needs to contain only information which is clear and humanly
readable to debug. You can always dump the raw data underneath from the
tracepoint but make the beginning human readable."

> Maybe. But these records are generated by Platform Firmware. Why would
> FW report the error knowing the system is about to die?

WTF more! Dude, are you kidding me?

So the firmware should not report the error if it knows the system is
about to die?!?!

Now you're just making up insane counterarguments, just because.

> Your example still says "Hardware Error"

Oh my, that's the *prefix*.

> and odds are general users won't understand what the error type means
> or what a corrected error is. So it's not much better.

Yeah, and the next thing you'll say is that users won't understand what
"error" means, right?

Geez.

> Exactly! The more info available (usually) the more quickly it can be
> diagnosed.

You still don't understand what I'm trying to explain to you.

It is *not* about diagnosing it - in order to do that you need to
involve people to diagnose it. It is about making the error record as
human readable as possible so that you don't *have* to involve people to
diagnose it in the first place and the user can say, ah ok, corrected
error, no need to do anything.

Or "System Fatal error" - I better replace that part.

> Hardware errors are generally rare and hard to reproduce.

Except when they're not like DRAM ECC floods which are pretty easy to
reproduce.

> So when one does occur we should capture the data and provide it.

Did I say you should not do that?!

I said: make it as human readable as possible and dump the gory hex crap
after it.

> Here are a couple of scenarios based on similar experiences I've had:

Now play that same scenario with the following record format:

[ human readable error record ]
[ full raw error dump ]

> I'll send a V3 set with the following changes:
> 1) Fix table numbering in commit messages.
> 2) Remove "Validation Bits" lines.
> 3) Only print error type GUID for unknown types.
>
> I think this set should focus on printing the x86 CPER based on the UEFI
> spec and the convention of the other CPER code. CPERs are generated
> by Platform Firmware. So errors are explicitly intended to be viewed by
> the user and all info should be displayed.

You should look up from the spec and realize that real life is much
different.

> I *have* been thinking that it would be nice to take the CPER and pipe it
> through the MCA decoding in arch/x86 and EDAC. This would be really
> nice for when the CPER includes the MCA registers in the Context info.
> So we'd get our usual MCA decoding instead of a binary blob of registers.

That would definitely be a step in the right direction.

> I was thinking that the MCA decoding would be in addition to this. But
> based on Boris's comments, maybe we can make it a default selection.
> For example, if MCA/EDAC decoding is available, use it. Otherwise, print
> the CPER fields in a generic way like we do for the other CPER types.

Yes.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix ImendÃrffer, Jane Smithard, Graham Norton, HRB 21284 (AG NÃrnberg)
--