Re: edac driver injection of uncorrected errors & utils

From: Tracy Smith
Date: Wed Nov 28 2018 - 17:14:43 EST


Nothing appears in the logs or from the edac-util indicating there was
a multi-bit UE (uncorrected error). Just a crash and even then I'm not
100% certain it is caused by multi-bit errors without debugging the
crash. It happened when writing a 1 to inject_data_lo/inject_data_hi
and 0x100 to inject_ctrl.

Is there another way of creating an uncorrected error without crashing
Linux using the layerscape driver? I would like to see a UE error
collected without a Linux crash scenario because I need to validate
UEs are being collected.

Does the AMD platform, or other memory controllers crash Linux on
multi-bit errors and fail to collect uncorrected errors? This is a
concern in the field since there is no way of knowing that multi-bit
errors occurred and that multi-bit errors caused the crash.

For production and in the field, can't have the Linux kernel or
layerscape driver crashing the kernel when there are multi-bit errors
and not giving any information on what caused the crash in the kernel
log. First, it could cost millions in high critical use cases.
Second, it is should be preventable.

So two concerns/questions:

1. Need a way to validate UE errors are captured without crashing the kernel
2. On multi-bit errors need a way to catch a UE before a kernel crash
and ideally prevent the kernel from crashing on multi-bit errors

Any recommendations?

Scenario produced on an ARM layerscape board.

echo 1 > /sys/devices/system/edac/mc/mc0/inject_data_lo
echo 1 > /sys/devices/system/edac/mc/mc0/inject_data_hi
echo 0x100 > /sys/devices/system/edac/mc/mc0/inject_ctrl

[495.327720] CPU: 3 PID: 1239 Comm: sh Not tainted 4.1.35-rt41#1
[ 495.327723] EDAC FSL_DDR MC0: Err Detect Register: 0x80000008
[ 495.327725] Hardware name: LS1043A Board (DT)
[ 495.327735] task: ffff800063dd3300 ti: ffff800073358000 task.ti:
ffff800073358000
[ 495.327740] PC is at 0x42cf80
[ 495.327742] LR is at 0x42d20c
[ 495.327745] pc : [<000000000042cf80>] lr : [<000000000042d20c>]
pstate: 20000000
[ 495.327746] sp : ffff80007335bff0
[ 495.327751] x29: 0000ffffd1f0b6e0 x28: 00000000004e0000
[ 495.327756] x27: 000000003cdf81b0 x26: 00000000004d8000
[ 495.327760] x25: 00000000004aea80 x24: 00000000004aea88
[ 495.327764] x23: 00000000004e1000 x22: 00000000004c0e10
[ 495.327768] x21: 00000000004aed98 x20: 00000000004ae868
[ 495.327772] x19: 00000000004ae868 x18: 0000000000000015
[ 495.327776] x17: 0000ffff7a24fb48 x16: 00000000004d8638
[ 495.327781] x15: 002372c270000000 x14: ffffffffffffffff
[ 495.327785] x13: 0000000000000018 x12: 0000000000000028
[ 495.327789] x11: 0000000000000038 x10: 0101010101010101
[ 495.327793] x9 : fefefefefefefeff x8 : 000000003ce19f50
[ 495.327797] x7 : 0000ffffd1f0b9e8 x6 : 0000000000000000
[ 495.327801] x5 : 00000000004e1dd0 x4 : 000000003ce19e50
[ 495.327805] x3 : 0000000000000000 x2 : 0000ffffd1f0b7f0
[ 495.327809] x1 : 0000ffffd1f0b7e0 x0 : 00000000004ae868
[ 495.327810]
[ 495.327817] Unhandled fault: synchronous external abort
(0x96000210) at 0xffff800000e1ec10
On Wed, Nov 28, 2018 at 1:24 PM York Sun <york.sun@xxxxxxx> wrote:
>
> Tracy,
>
> This DDR controller doesn't have the capability to inject limited
> errors. As soon as you enable the error injection, all memory
> transactions will carry the error. Since multi-bit errors are not
> correctable. I don't expect Linux to work properly with these errors.
>
> York
>
>
> On 11/28/18 1:11 PM, Tracy Smith wrote:
> > Thanks York. Why will injecting multi-bit errors crash linux? Is this
> > the case only for layerscape? Is there a way to harden against this?
> >
> > On Wed, Nov 28, 2018 at 1:06 PM York Sun <york.sun@xxxxxxx> wrote:
> >>
> >> Tracy,
> >>
> >> You can inject multiple-bit errors. You will crash the system for doing
> >> that. I can't comment on edac-util.
> >>
> >> York
> >>
> >>
> >> On 11/28/18 12:49 PM, Tracy Smith wrote:
> >>> Can I inject a uncorrected error or only corrected errors using the
> >>> layerscape edac driver injection via sysfs?
> >>>
> >>> Is this the expected output for the edac-util on layerscape when
> >>> injecting errors?
> >>>
> >>> root@ls1043ardb:~# edac-util -v
> >>> mc0: 0 Uncorrected Errors with no DIMM info
> >>> mc0: 0 Corrected Errors with no DIMM info
> >>> mc0: csrow0: 0 Uncorrected Errors
> >>> mc0: csrow0: mc#0csrow#0channel#0: 643 Corrected Errors
> >>>
> >>> root@ls1043ardb:~# edac-util -vs
> >>> edac-util: EDAC drivers are loaded. 1 MC detected:
> >>> mc0:fsl_mc_err
> >>>
> >>> root@ls1043ardb:~# edac-util
> >>> mc0: csrow0: mc#0csrow#0channel#0: 2700 Corrected Errors
> >>>
> >>> Does edac-ctl function on ARM based platforms or only on x86 and why
> >>> might it show 0MB for the memory layout for DDR4 as below?
> >>>
> >>> /run/media/nvme0n1p1/tls/neo_mcu-kernel/drivers/edac-utils# edac-ctl
> >>> --layoutreadline() on closed filehandle IN at /usr/sbin/edac-ctl line
> >>> 514.
> >>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> >>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> >>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> >>> Use of uninitialized value $size in sprintf at /usr/sbin/edac-ctl line 533.
> >>> +-----------------------------------------------+
> >>> | mc0 |
> >>> | csrow0 | csrow1 | csrow2 | csrow3 |
> >>> ----------+-----------------------------------------------+
> >>> channel0: | 0 MB | 0 MB | 0 MB | 0 MB |
> >>> ----------+-----------------------------------------------+
> >>>
> >>
> >
> >
> > --
> > Confidentiality notice: This e-mail message, including any
> > attachments, may contain legally privileged and/or confidential
> > information. If you are not the intended recipient(s), please
> > immediately notify the sender and delete this e-mail message.
> >
>


--
Confidentiality notice: This e-mail message, including any
attachments, may contain legally privileged and/or confidential
information. If you are not the intended recipient(s), please
immediately notify the sender and delete this e-mail message.