RE: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info Structure

From: Ghannam, Yazen
Date: Tue Feb 27 2018 - 13:40:23 EST


> -----Original Message-----
> From: Borislav Petkov [mailto:bp@xxxxxxx]
> Sent: Tuesday, February 27, 2018 1:03 PM
> To: Ghannam, Yazen <Yazen.Ghannam@xxxxxxx>
> Cc: linux-efi@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> ard.biesheuvel@xxxxxxxxxx; x86@xxxxxxxxxx
> Subject: Re: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info
> Structure
>
> On Tue, Feb 27, 2018 at 05:46:54PM +0000, Ghannam, Yazen wrote:
> > I think there's value in following the conventions in a subsystem.
>
> "conventions in a subsystem" my ass. That's brainless copy-pasting.
>
> It was added by
>
> f6edea77c8c8 ("ACPI, APEI, CPER: Cleanup CPER memory error output
> format")
>
> and then replicated everywhere.
>
> It is simply a dumb way of writing:
>
> snprintf(newpfx, sizeof(newpfx), "%s ", pfx);
>

There's a space after the %s, right? I missed it at first glance, though maybe
my LASIK didn't take very well.

>
> > I can change this if you give a reason besides "it's dumb".
>
> Two can play that game: you get to keep it if you give a good reason
> why.
>

Code readability.

...
> > And the raw value should still be printed because
> > 1) It may represent a type that we can't decode. Maybe a type that's not
> part of
> > the spec.
>
> If we can't decode it, *then* you dump it:
>
> "Unrecognized type: 0x%llx ..."
>
> > 2) It's good to have the raw value for reference. We do this with
> MCA_STATUS
> > where we print the raw value followed by the decoding.
>
> 1. No one stares at the raw value if the bits are decoded
> 2. MCA_STATUS is one register - this error record is huge.
>

1) No one except debug and HW design folks, who will eventually get a
report from a user.
2) You're right, the record is huge. Printing out the Validation Bits, GUID,
and raw Check Info will be an extra 5 lines.

I posted an example at the end.

We could get rid of the GUID for known types like you suggest. Also, we
can drop the Validation Bits for the Check Info since that's already part of
it.

Thanks,
Yazen

[ 1.990948] [Hardware Error]: Error 1, type: corrected
[ 1.995789] [Hardware Error]: fru_text: ProcessorError
[ 2.000632] [Hardware Error]: section_type: IA32/X64 processor error
[ 2.005796] [Hardware Error]: Validation Bits: 0x0000000000000207
[ 2.010953] [Hardware Error]: Local APIC_ID: 0x0
[ 2.015991] [Hardware Error]: CPUID Info:
[ 2.020747] [Hardware Error]: 00000000: 00800f12 00000000 00400800 00000000
[ 2.025595] [Hardware Error]: 00000010: 76d8320b 00000000 178bfbff 00000000
[ 2.030423] [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
[ 2.035198] [Hardware Error]: Error Information Structure 0:
[ 2.039961] [Hardware Error]: Error Structure Type: a55701f5-e3ef-43de-ac72-249b573fad2c
[ 2.049608] [Hardware Error]: Error Structure Type: cache error
[ 2.054344] [Hardware Error]: Validation Bits: 0x0000000000000001
[ 2.059046] [Hardware Error]: Check Information: 0x0000000020540087
[ 2.063625] [Hardware Error]: Validation Bits: 0x0087
[ 2.068032] [Hardware Error]: Transaction Type: 0, Instruction
[ 2.072423] [Hardware Error]: Operation: 5, instruction fetch
[ 2.076776] [Hardware Error]: Level: 1
[ 2.081073] [Hardware Error]: Overflow: true
[ 2.085360] [Hardware Error]: Context Information Structure 0:
[ 2.089661] [Hardware Error]: Register Context Type: MSR Registers (Machine Check and other MSRs)
[ 2.098487] [Hardware Error]: Register Array Size: 0x0050
[ 2.103113] [Hardware Error]: MSR Address: 0xc0002011
[ 2.107742] [Hardware Error]: Register Array:
[ 2.112270] [Hardware Error]: 00000000: d8200000000a0151 0000000000000000
[ 2.116845] [Hardware Error]: 00000010: d010000000000000 0000000300000031
[ 2.121228] [Hardware Error]: 00000020: 000100b000000000 000000004a000000
[ 2.125514] [Hardware Error]: 00000030: 0000000000000000 0000000000000000
[ 2.129747] [Hardware Error]: 00000040: 0000000000000000 0000000000000000