Re: [PATCH 1/3] mce: Add a msg string to the MCE tracepoint

From: Mauro Carvalho Chehab
Date: Wed Feb 29 2012 - 12:46:03 EST


Em 29-02-2012 14:16, Borislav Petkov escreveu:
> On Wed, Feb 29, 2012 at 04:58:09PM +0000, Luck, Tony wrote:
>>> - severity: No real need for it. If the error is severe enough, the
>>> kernel handles automatically, i.e. memory poisoning and recovery. In all
>>> the other cases it is not severe enough.
>>
>> We'll never see fatal errors via the perf/tracepoint (no way the RAS daemon
>> will run to pull them).

With the current approach, that's true.

I remember you've mentioned an idea of storing fatal errors on an APEI
non-volatile memory for them to be sent to userspace after a machine
reboot. If this is implemented some day, those type of errors could
also be reported via trace, depending on how such feature would be
implemented, as one possibility would be to just store there the contents
of the last dmesg content.

>> But we will see both corrected error chatter and
>> recovered uncorrectable errors. I would be able to tell these apart.
>> Corrected errors in small doses are normal and don't require any
>> action beyond logging so you can see whether there are enough to cross
>> a threshold and cause alarm. Recovered uncorrectable errors are going
>> to be much rarer, and I think deserve closer scrutiny - even when there
>> is just one of them.
>> If you drop the severity field, is there some other way to make this
>> distinction?
>
> Err, MCi_STATUS bits like bit 55 (Action Required) and 56 (Signaled #MC)
> in your case...?

That would force all userspace tools that handle such errors to have
some MCA-specific logic inside, which is one of the things we're trying
to avoid. Also, non-MCA drivers will have a severity that aren't present
at the MCA status.

Assuming that the same tool can work with both MCA and non-MCA drivers,
for API consistency, we should try to use the same way to describe
severity (and label/location) on both MCA and non-MCA cases.

With regards to Intel, as far as I know, there are some cpu-family
specific stuff for recovered uncorrectable errors.

>>> - silkscreen_label: <sarcasm> yeah, I'm getting a, say, a Data
>>> Cache error during an L1 linefill from L2, what the f*ck does the
>>> silkscreen label mean for such an error?! Well, nobody knows wtf it
>>> means!</sarcasm>
>>
>> Cache error should point to a cpu socket - I'd like to have a silk
>> screen label for that (are they numbered "0, 1, 2 ..." on the motherboard
>> or "1, 2, 3 ..."?) No idea where we'd get that information from. dmidecode
>> shows "Socket Designation: CPU 1" (and "2") for my current Sandy Bridge
>> system. I'd have to pull the system apart to see if those are helpful
>> in identifying which physical cpu is which.
>
> First of all, silkscreen label denotes DIMM slots in this context
> AFAICT.

No. I'm referring to the Silkscreen label (and the location) of the
affected component, and not to just DIMMs.

> Concerning CPU sockets, I'm not aware of a method to read out
> the silkscreen labels at the CPU sockets, are you? Or am I missing
> something?

The same strategy used by edac can be used there: add a 'label' node to
/sys/devices/system/cpu/cpu?/

To allow userspace to fill it/override it.

> IOW, we want to assume that cores 0, 1, 2 ... k-1 are on node 0; k, k+1
> ... 2k-1 belong to node 1, etc., where k is the number of cores on a
> socket and thus we have a regular core enumeration on the box.

Initially, the RAS code could fill the 'label' using the above criteria,
while allowing an userspace tool to get the labels from dmidecode and
use it there.

Regards,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/