Re: [PATCH v3 00/31] Hardware Events Report Mecanism (HERM)

From: Mauro Carvalho Chehab
Date: Fri Feb 10 2012 - 11:39:54 EST


Em 10-02-2012 11:26, Borislav Petkov escreveu:
> On Thu, Feb 09, 2012 at 10:00:59PM -0200, Mauro Carvalho Chehab wrote:
>> The old sysfs nodes are still supported. Latter patches will allow
>> disabling the old sysfs nodes.
>
> I wouldn't remove those easily since they're documented as an EDAC
> interface in <Documentation/edac.txt> and, as such, are probably used by
> people.

Yes, deprecating it will take some kernel releases. The main client for it
is the edac-utils, but sysadmins may have their own scripts.

IMO, we should provide a Kconfig to allow disabling the legacy sysfs, but
keep it enabled by default. With time, we may remove it together with
the backport code.

>
>> All errors currently generate the printk events, as before, but they'll
>> also generate perf events like:
>
> This format needs a bit more massaging:
>
>>
>> bash-1680 [001] 152.349448: mc_error: [Hardware Error]:
>> mce#0: Uncorrected error FAKE ERROR on label "mc#0channel#2slot#2 "
>> (channel 2 slot 2 page 0x0 offset 0x0 grain 0 for EDAC testing only)
>
> I don't see why the process and PID are relevant to the error reported
> so it should probably go. I dunno whether this can easily be done with
> the current ftrace code...

Yes, those data is not relevant. I dunno how this is implemented there.
Maybe someone more experienced with trace internal implementation
could help us on this matter.

>
>> kworker/u:5-198 [006] 1341.771535: mc_error_mce: mce#0: Corrected
>> error memory read error on label "CPU_SrcID#0_Channel#3_DIMM#0 "
>> (channel 0 slot 0 page 0x3a2db9 offset 0x7ac grain 32 syndrome
>> 0x0 CPU: 0, MCGc/s: 1000c14/0, MC5: 8c00004000010090, ADDR/MISC:
>> 00000003a2db97ac/00000020404c4c86, RIP: 00:<0000000000000000>, TSC: 0,
>> PROCESSOR: 0:206d6, TIME: 1328829250, SOCKET: 0, APIC: 0 1 error(s):
>> Unknown: Err=0001:0090 socket=0 channel=2/mask=4 rank=1)
>
> This is too much, you probably only want to say:
>
> Corrected DRAM read error on DIMM "CPU..."
>
> The channel, slot, page etc should be only Kconfigurable for people who
> really need it.

Not sure if it is a good idea to remove those info via Kconfig, as this
would mean that the userspace parsers will need to be prepared to work
with both ways. It is probably better/easier to pass everything to
userspace, and add a filter there.

>> kworker/u:5-198 [006] 1341.792536: mc_error_mce: mce#0: Corrected
>> error Can't discover the memory rank for ch addr 0x60f2a6d76 on
>> label "any memory" ( page 0x0 offset 0x0 grain 32 syndrome 0x0
>> CPU: 0, MCGc/s: 1000c14/0, MC5: 8c00004000010090, ADDR/MISC:
>> 0000000c1e54dab6/00000020404c4c86, RIP: 00:<0000000000000000>, TSC: 0,
>> PROCESSOR: 0:206d6, TIME: 1328829250, SOCKET: 0, APIC: 0 )
>
> I guess we can report EDAC failures to map the DIMM properly like this.

Hmm... Yes, that may work, but changing all the drivers to provide the
information on that way is not simple.

There are two different situations here:

1) the driver knows that it was not able to detect the memory rank.
all it needs to do is to fill the message like above. No issue at all.

2) the driver thinks that the information were decoded, but there's
a bug there. This is what the "out-of-range" error message tracks.

Changing all drivers to provide a message like above for (2) requires
lots of changes and tests, on each driver, and will provide a very
little benefit: if such error ever happens, the EDAC driver needs a
fix, and the parsed information is not reliable (the mce one can still
be used). In such case, a bug against the driver should be filled.

There's no way for doing properly at core level, as the way to decode
such out-of-range bugs is memory-architecture dependent. So, something
like:
"Can't discover the memory rank for ch addr 0x60f2a6d76"

Doesn't make much sense for FB-DIMM memory driver.

On such drivers, identifying the channel and the slot is easy: all it is
needed is to check what AMB were selected. However, identifying the
channel addr is not relevant on this case and some drivers aren't even
able to get such info, as the memory controller may be interlacing the
memories between different channels.

Also, an out of range information would mean that the driver maintainer
will need to fix something there. A message like:

kernel: EDAC MC0: UE - no information available: INTERNAL ERROR
kernel: EDAC MC0: INTERNAL ERROR: channel-b out of range (4 >= 4)

(this is a real case example)

Is clearer for a driver maintainer than:

Can't discover the memory rank for ch addr 0x60f2a6d76

As it points exactly what's wrong at the parser.

(in this specific example, the driver were able to get the proper error,
but the FBDIMM error call it were using required two channels, due to one
of the EDAC internal constraints). As most FB-DIMM drivers weren't able to
attribute an UE error to a single channel, the typical call were to pass
channel, channel + 1. In this specific MC, this is wrong.

>> New sysfs nodes are now provided, to match the real memory architecture.
>
> ... and we need those because...?

Because manufacturers wanting to map their motherboards into EDAC is finding
a very bad time, as there's no easy way to map what they have with a random
driver-specific fake csrows/channel information. Anywone wanting to do such
mapping right now would need to read and understand the edac driver. The
only driver where such mapping is easy seems to be amd64_edac, as it doesn't
support FB-DIMMs, nor the memory controller abstracts csrows information or
provides more than 2 channels.

For example, on the above driver, there's no "channel-b". The error were
on branch 1, channel 1 of branch 1 (the third channel of the memory controller).
The only way to discover it is after carefully analyzing the driver. So, anyone
trying to map what the motherboard's label DIMM 1D would quickly discover that
it means branch 1, channel 1, slot 1, but some drivers would map it as:

csrow 3, channel 1 (branch, channel -> csrow; slot ->channel)
others as:
csrow 7, channel 0 (branch, channel, slot -> csrow; channel always 0)
and others as:
csrow 1, channel 3. (slot -> csrow; branch, channel -> channel)

(yes, all 3 types of mapping exists at the drivers)

>
>> For example, on a Sandy Bridge-EP machine, with up to 4 channels, and up
>> to 3 DIMMs per channel:
>>
>> /sys/devices/system/edac/mc/mc0/
>> âââ ce_channel0
>> âââ ce_channel0_slot0
>> âââ ce_channel0_slot1
>> âââ ce_channel0_slot2
>> âââ ce_channel1
>> âââ ce_channel1_slot0
>> âââ ce_channel1_slot1
>> âââ ce_channel1_slot2
>> âââ ce_channel2
>> âââ ce_channel2_slot0
>> âââ ce_channel2_slot1
>> âââ ce_channel2_slot2
>> âââ ce_channel3
>> âââ ce_channel3_slot0
>> âââ ce_channel3_slot1
>> âââ ce_channel3_slot2
>> âââ ce_count
>> âââ ce_noinfo_count
>> âââ dimm0
>> â âââ dimm_dev_type
>> â âââ dimm_edac_mode
>> â âââ dimm_label
>> â âââ dimm_location
>> â âââ dimm_mem_type
>> â âââ dimm_size
>> âââ dimm1
>> â âââ dimm_dev_type
>> â âââ dimm_edac_mode
>> â âââ dimm_label
>> â âââ dimm_location
>> â âââ dimm_mem_type
>> â âââ dimm_size
>> âââ fake_inject
>> âââ ue_channel0
>> âââ ue_channel0_slot0
>> âââ ue_channel0_slot1
>> âââ ue_channel0_slot2
>> âââ ue_channel1
>> âââ ue_channel1_slot0
>> âââ ue_channel1_slot1
>> âââ ue_channel1_slot2
>> âââ ue_channel2
>> âââ ue_channel2_slot0
>> âââ ue_channel2_slot1
>> âââ ue_channel2_slot2
>> âââ ue_channel3
>> âââ ue_channel3_slot0
>> âââ ue_channel3_slot1
>> âââ ue_channel3_slot2
>> âââ ue_count
>> âââ ue_noinfo_count
>>
>> One of the above nodes allow testing the error report mechanism by
>> providing a simple driver-independent way to inject errors (fake_inject).
>> This node is enabled only when CONFIG_EDAC_DEBUG is enabled, and it
>> is limited to test the core EDAC report mechanisms, but it helps to
>> test if the tracing events are properly accredited to the right DIMMs.
>
> What happens with the inject_* sysfs nodes which are in EDAC already?

Will keep there, as-is. This is just yet-another testing feature, and won't
interfere or superseed the existing ones.

> [..]
>
>> The memory error handling function has now the capability of reporting
>> more than one dimm, when it is not possible to put the fingers into
>> a single place.
>>
>> For example:
>> # echo 1 >/sys/devices/system/edac/mc/mc0/fake_inject && dmesg |tail -1
>> [ 2878.130704] EDAC MC0: CE FAKE ERROR on mc#0channel#1slot#0 mc#0channel#1slot#1 mc#0channel#1slot#2 (channel 1 page 0x0 offset 0x0 grain 0 syndrome 0x0 for EDAC testing only)
>>
>> All dimm memories present on channel 1 are pointed as one of them were
>> responsible for the error.
>
> I don't see how this can be of any help? I like the EDAC failure message
> better: if we cannot map it properly for some reason, we tell so the
> user instead of generating some misleading data.

This is not a misleading data. Depending on how the ECC code is generated,
there's no way to point to a single dimm, because two or more memories are
used to produce the ECC data.

FB-DIMM memories can be in lockstep mode. If so, UE errors happen on a
memory pair.

If the system admin wants to quickly recover the machine, he needs to know
that replacing the 2 affected memories, the machine will work. He can later
put the affected memories on a separate hardware, using a single-channel
mode, in order to discover what's broken, but pointing to the two affected
memories helps him to recover quickly, while still allowing him to further
track where the problem is.

Btw, on Sandy Bridge, a memory can be on both lockstep and mirror mode. So,
if an UE error occurs, 4 DIMM's maybe affected.

>
>>
>> With regards to the output, the errors are now reported on a more
>> user-friendly way, e. g. the EDAC core will output:
>>
>> - the timestamp;
>> - the memory controller;
>> - if the error is corrected, uncorrected or fatal;
>> - the error message (driver specific, for example "read error", "scrubbing
>> error", etc)
>> - the affected memory labels.
>
> "labels"? See above, if we cannot report it properly, we better say so
> instead of misleading with multiple labels.

What the poor user is expected to do on such case, if it is not pointed to
some memories for him to test? Ok, we can improve the message to make it
clearer that likely just one of the pointed memories were affected, but
letting him with no glue would be a nightmare for the users.

>> Other technical details are provided, inside parenthesis, in order to
>> allow hardware manufacturers, OEM, etc to have more details on it, and
>> discover what DRAM has problems, if they want/need to.
>
> Exactly, "if they want/need to" sounds like a Kconfig option to me which
> can be turned on when needed.

I'm yet to know a real usecase where the user doesn't want that. He may not be
the final consumer of such data, but what we've seen, in practice, is that,
when people need to replace bad memory sticks, they go after the machine vendors,
asking for warranty replacement. The vendors usually request a more detailed
info than just "dimm xxx is broken". The rest of the log helps them to show
what actually happened with the memories, and the vendor to verify that the
complain is pertinent.

Anyway, as I said before, the better would be that the userspace tool that
retrieves such data to have an option to show the details or not.

>> Ah, now that the memory architecture is properly represented, the DIMM
>> labels are automatically filled by the mc_alloc function call, in order
>> to properly represent the memory architecture.
>>
>> For example, in the case of Sandy Bridge, a memory can be described as:
>> mc#0channel#1slot#0
>>
>> This matches the way the memory is known inside the technical information,
>> and, hopefully, at the OEM manuals for the motherboard.
>
> This is not always the case. You need the silkscreen labels from the
> board manufacturers as they do not always match with the DIMM topology
> from the hw vendor. OEM vendor BIOS should do this with a table of
> silkscreen labels or something.

Yes. However, as I've already explained, OEM vendors don't know what
"csrow 3, channel 2" means, as there are several different ways of mapping
channel#, slot# into csrow/channel, and there are at least 4 or 5 different
mapping logic inside the drivers.

If you take a look at the existing drivers that don't use csrow/channel,
as a general rule, each driver will to its own proprietary fake mapping,
with causes lots of problem for OEM's, as they need a hardware engineer
(and/or the hardware diagram) to get the real data, and a software engineer
to analyze the driver and map it into the EDAC's internal fake representation.

Regards,
Mauro.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/