Re: [PATCH v2] EDAC: expose per-dimm error counts in sysfs

From: Borislav Petkov
Date: Fri Oct 28 2016 - 09:03:45 EST


On Thu, Oct 27, 2016 at 02:33:32PM -0700, Aaron Miller wrote:
> The old 'csrowX' sysfs directories had per-csrow error counters, but the
> new 'dimmX' directories do not currently expose error counts.
>
> EDAC already keeps these counts, add them to sysfs so per-dimm counts
> are still available when CONFIG_EDAC_LEGACY_SYSFS=n
>
> Signed-off-by: Aaron Miller <aaronmiller@xxxxxx>
> ---

Hmm, so something's still broken in the adding of the error count,
especially that dancing around with layers.

Mauro, I think you should take a look. Here's what I did:

EDAC_DFS=/sys/kernel/debug/edac/mc0/

echo 24 > $EDAC_DFS/fake_inject_channel
echo 0 > $EDAC_DFS/fake_inject_slot
echo 3 > $EDAC_DFS/fake_inject_count

echo 1 > $EDAC_DFS/fake_inject

24 is the max channels the system reports during start:

[ 4864.030901] EDAC MC: Removed device 0 for sbridge_edac.c Sandy Bridge Socket#0: DEV 0000:3f:0e.0
[ 4865.867873] EDAC MC: Ver: 3.0.0
[ 4866.081102] edac_mc_alloc: errcount layer 0 size 8
[ 4866.086081] edac_mc_alloc: errcount layer 1 size 24
[ 4866.091133] edac_mc_alloc: allocating 64 error counters
[ 4866.096529] edac_mc_alloc: allocating 3320 bytes for mci data (24 dimms, 24 csrows/channels)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[ 4866.105146] edac_mc_alloc: ce_per_layer[0]: ffff8928b7e918f8
[ 4866.110979] edac_mc_alloc: ce_per_layer[1]: ffff8928b7e91938
[ 4866.619521] slot 0 <7>[ 4866.621530] EDAC DEBUG: edac_create_dimm_object: creating rank/dimm device dimm0
[ 4866.637405] slot 0 <7>[ 4866.639420] EDAC DEBUG: edac_create_dimm_object: creating rank/dimm device dimm3
[ 4866.655282] slot 0 <7>[ 4866.657288] EDAC DEBUG: edac_create_dimm_object: creating rank/dimm device dimm6
[ 4866.673168] slot 0 <7>[ 4866.675190] EDAC DEBUG: edac_create_dimm_object: creating rank/dimm device dimm9
[ 4866.691114] EDAC MC0: Giving out device to module sbridge_edac.c controller Sandy Bridge Socket#0: DEV 0000:3f:0e.0 (INTERRUPT)

and when I do that injection, there's boom, see below.

It looks to me that indexing in edac_inc_ce_error() goes out of bounds:

mci->ce_per_layer[i][index] += count;

if (i < mci->n_layers - 1)
index *= mci->layers[i + 1].size;

and it shouldn't.

We need to be prepared to handle crap data coming from the hardware so
those error count adding functions should check array lengths and scream
if something's overflowing...

[ 5466.964814] Generating 3 CE fake errors to 24.0.0 to test core handling. NOTE: this won't test the driver-specific decoding logic.
[ 5466.977512] EDAC MC0: INTERNAL ERROR: channel value is out of range (24 >= 8)
[ 5466.984973] edac_raw_mc_handle_error: top: 24, mid: 0, low: 0
[ 5466.991048] EDAC MC0: 3 CE FAKE ERROR on unknown memory (slot:0 page:0x0 offset:0x0 grain:0 syndrome:0x0 - for EDAC testing only)
[ 5467.003059] edac_inc_ce_error: entry, count: 3
[ 5467.007835] BUG: unable to handle kernel paging request at 00000005966a7000
[ 5467.015157] IP: [<ffffffffc0574b34>] edac_raw_mc_handle_error+0x604/0x690 [edac_core]
[ 5467.023336] PGD 436e86067
[ 5467.025874] PUD 0
[ 5467.030007] Oops: 0000 [#1] PREEMPT SMP
[ 5467.034148] Modules linked in: sb_edac edac_core binfmt_misc xfs libcrc32c nls_iso8859_15 nls_cp437 btrfs x86_pkg_temp_thermal coretemp kvm_intel xor kvm raid6_pq snd_hda_codec_hdmi irqbypass crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel snd_hda_codec_realtek snd_hda_codec_generic aes_x86_64 glue_helper lrw snd_hda_intel gf128mul snd_hda_codec ablk_helper cryptd intel_cstate intel_uncore snd_hwdep iTCO_wdt efi_pstore iTCO_vendor_support dcdbas snd_hda_core efivars serio_raw intel_rapl_perf dell_smm_hwmon snd_pcm button snd_timer acpi_cpufreq snd i2c_i801 lpc_ich mfd_core soundcore i2c_smbus processor [last unloaded: edac_core]
[ 5467.092917] CPU: 0 PID: 30180 Comm: edac-fake-injec Not tainted 4.9.0-rc2+ #2
[ 5467.100487] Hardware name: Dell Inc. Precision T3600/0PTTT9, BIOS A13 05/11/2014
[ 5467.108300] task: ffff8928b13996c0 task.stack: ffff9f0802b84000
[ 5467.114636] RIP: 0010:[<ffffffffc0574b34>] edac_raw_mc_handle_error+0x604/0x690 [edac_core]
[ 5467.123403] RSP: 0018:ffff9f0802b87bd0 EFLAGS: 00010286
[ 5467.129012] RAX: 00000000bac25000 RBX: 0000000000000000 RCX: 0000000000000018
[ 5467.136521] RDX: 0000000000000000 RSI: ffffffffc057b400 RDI: ffffffffc057cd08
[ 5467.144012] RBP: ffff9f0802b87ca8 R08: 0000000000000001 R09: 0000000000000000
[ 5467.151513] R10: 0000000000ffff10 R11: 0000000000000001 R12: 0000000000000018
[ 5467.159013] R13: 0000000000000003 R14: ffff8928b7e91000 R15: 0000000000000000
[ 5467.166518] FS: 00007f14b5453700(0000) GS:ffff8928bde00000(0000) knlGS:0000000000000000
[ 5467.174988] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5467.181089] CR2: 00000005966a7000 CR3: 0000000430823000 CR4: 00000000000406f0
[ 5467.188583] Stack:
[ 5467.190905] ffff8928b7e91658 ffff9f0802b87c28 ffffffffc057c952 0000000000000000
[ 5467.198653] 0000000000000000 ffffffffc057c105 ffffffffc057c952 ffffffff00000000
[ 5467.206399] 0000000000000000 000000187fffffff 0000000000000000 3078303a65676170
[ 5467.214146] Call Trace:
[ 5467.216905] [<ffffffff8b0bee99>] ? __lock_is_held+0x49/0x70
[ 5467.222894] [<ffffffffc0574fba>] edac_mc_handle_error+0x3fa/0x7f0 [edac_core]
[ 5467.230448] [<ffffffff8b348236>] ? debugfs_use_file_start+0x46/0xa0
[ 5467.237108] [<ffffffffc0579508>] edac_fake_inject_write+0xd8/0x100 [edac_core]
[ 5467.244722] [<ffffffff8b3487a4>] full_proxy_write+0x54/0xa0
[ 5467.250657] [<ffffffff8b203268>] __vfs_write+0x28/0x120
[ 5467.256255] [<ffffffff8b0dfdcb>] ? rcu_read_lock_sched_held+0x8b/0xa0
[ 5467.263060] [<ffffffff8b0e013f>] ? rcu_sync_lockdep_assert+0x2f/0x60
[ 5467.269780] [<ffffffff8b2073fc>] ? __sb_start_write+0x19c/0x240
[ 5467.276059] [<ffffffff8b204186>] ? vfs_write+0x186/0x1a0
[ 5467.281722] [<ffffffff8b2040b8>] vfs_write+0xb8/0x1a0
[ 5467.287120] [<ffffffff8b205429>] SyS_write+0x49/0xa0
[ 5467.292428] [<ffffffff8b002f57>] do_syscall_64+0x77/0x440
[ 5467.298170] [<ffffffff8b86a38d>] entry_SYSCALL64_slow_path+0x25/0x25
[ 5467.304873] RIP: 0033:[<00007f14b4923640>] 0x7f14b4923640
[ 5467.310527] RSP: 002b:00007ffc5f945268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 5467.318368] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f14b4923640
[ 5467.325768] RDX: 0000000000000002 RSI: 000000000106c408 RDI: 0000000000000001
[ 5467.333166] RBP: 000000000106c408 R08: 000000000000000a R09: 00007f14b5453700
[ 5467.340562] R10: 00007f14b4bec6a0 R11: 0000000000000246 R12: 00007f14b4bee2a0
[ 5467.347956] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
[ 5467.355349] Code: 44 89 fa 48 c7 c6 00 b4 57 c0 48 c7 c7 08 cd 57 c0 44 89 bd 60 ff ff ff 49 89 c4 48 6b c0 0c 44 89 e1 49 03 86 b8 04 00 00 8b 00 <4c> 8b 04 c5 00 f0 57 c0 31 c0 e8 4a a8 c0 ca 45 85 e4 79 0c 4c
[ 5467.375173] RIP [<ffffffffc0574b34>] edac_raw_mc_handle_error+0x604/0x690 [edac_core]
[ 5467.383379] RSP <ffff9f0802b87bd0>
[ 5467.387133] CR2: 00000005966a7000
[ 5467.396914] ---[ end trace 7cb35d517b5d01ac ]---

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.