This is a "help wanted" message :(.
I've recently run into some strange kernel oops testing Glibc for LoongArch. A log looks like:
[11569.195043] Kernel ade access[#1]:
[11569.198441] CPU: 1 PID: 1132296 Comm: ld-linux-loonga Not tainted 6.2.0-rc8+ #61
[11569.205792] Hardware name: Loongson Loongson-3A5000-HV-7A2000-1w-V0.1-EVB/Loongson-LS3A5000-7A2000-1w-EVB-V1.21, BIOS Loongson-UDK2018-V4.0.05383-beta10 1
[11569.219536] $ 0 : 0000000000000000 90000000005e3448 90000001113a0000 90000001113a3ab0
[11569.227505] $ 4 : 90000001113a3af8 1000000000cf16d0 5555555555555850 000000000000000c
[11569.235475] $ 8 : 90000000009caa10 0000000000000000 00000000000002ca 000000000000008b
[11569.243438] $12 : 0000000000000001 9000000000cf1258 ffffffffffffffff 00007ffffb93c000
[11569.251402] $16 : 0000000000000000 0000000000000140 0000000000000000 0000000000000020
[11569.259366] $20 : 90000001113a3ec8 9000000000a97ee0 00007ffffb93bfa0 1555555555555613
[11569.267334] $24 : 1000000000cf16d0 000000000000000c 9000000000cf1258 90000000009caa10
[11569.275303] $28 : 90000001113a3af8 0aaaaaaaaaaaab0a 00007ffffb93bde0 90000001113a3ec0
[11569.283268] era : 90000000009caa10 cmp_ex_search+0x0/0x28
[11569.288814] ra : 90000000005e3448 bsearch+0x58/0xa8
[11569.293921] CSR crmd: 000000b0
[11569.293923] CSR prmd: 00000004
[11569.297037] CSR euen: 00000000
[11569.300152] CSR ecfg: 00071c1c
[11569.303266] CSR estat: 00480000
[11569.309587] ExcCode : 8 (SubCode 1)
[11569.313049] BadVA : 1000000000cf16d0
[11569.316596] PrId : 0014c011 (Loongson-64bit)
[11569.320923] Modules linked in: amdgpu nls_cp936 vfat fat input_leds drm_ttm_helper ttm video gpu_sched drm_buddy snd_hda_codec_generic drm_display_helper ledtrig_audio drm_kms_helper led_class snd_hda_intel sha256_generic snd_intel_dspcfg cfbfillrect libsha256 snd_hda_codec syscopyarea snd_hda_core hid_generic cfbimgblt cfg80211 snd_pcm sysfillrect usbhid sysimgblt snd_timer cfbcopyarea hid snd igb soundcore efivarfs
[11569.357709] Process ld-linux-loonga (pid: 1132296, threadinfo=000000003cbd0caa, task=000000005bcd27a6)
[11569.366977] Stack : 00007ffffb93bd60 0000000000000000 9000000180a36a40 0000000000000001
[11569.374940] 90000001113a3bb0 00007ffffb93c000 9000000000224c94 90000000009cab2c
[11569.382899] 0000000000000001 9000000000224c94 00007ffff3258000 900000000025a1b4
[11569.390866] 90000001113a3bb0 900000000022f4cc 00007ffffb93c000 900000000022f74c
[11569.398834] 9000000180a36a40 0000000000000001 0000000000000000 00007ffffb93c000
[11569.406800] 90000001113a3bb0 900000000022f8f8 90000001113a3ec0 00007ffffb93bde0
[11569.414768] 00007ffffb93bd60 0000000000000000 0000000000000000 00007fffff7c4600
[11569.422734] 9000000182ebab70 9000000000d08000 0000000046505501 900000000022ee6c
[11569.430698] 0000000000000000 9000000000224b84 90000001113a0000 90000001113a3cf0
[11569.438661] 0000000000000000 00007ffffb93c0d0 0000000000000000 0000000000000040
[11569.446627] ...
[11569.449058] Call Trace:
[11569.449062] [<90000000009caa10>] cmp_ex_search+0x0/0x28
[11569.456681] [<90000000005e3448>] bsearch+0x58/0xa8
[11569.461443] [<90000000009cab2c>] search_extable+0x28/0x34
[11569.466807] [<900000000025a1b4>] search_exception_tables+0x48/0x7c
[11569.472953] [<900000000022f4cc>] fixup_exception+0x18/0xcc
[11569.478410] [<900000000022f74c>] do_sigsegv+0x174/0x1b0
[11569.483605] [<900000000022f8f8>] do_page_fault+0x170/0x344
[11569.489058] [<900000000022ee6c>] tlb_do_page_fault_1+0x128/0x1c4
[11569.495029] [<9000000000224b84>] handle_signal+0x634/0x884
[11569.500487] [<9000000000225704>] arch_do_signal_or_restart+0xb4/0xe0
[11569.506808] [<90000000002b5b30>] exit_to_user_mode_prepare+0xbc/0x100
[11569.513214] [<9000000000a02628>] syscall_exit_to_user_mode+0x30/0x4c
[11569.519533] [<90000000002214a4>] handle_syscall+0xc4/0x160
[11569.526472] Code: 4c000020 02800404 4c000020 <240000ac> 26000084 0010b0a5 680014a4 00129484 00111004
[11569.537704] ---[ end trace 0000000000000000 ]---
"BadVA : 1000000000cf16d0" may suggest the highest bit of an address is
somehow cleared.
The issue is not deterministic, but it seems easily reproduced by:
1. Compile Glibc:
../glibc/configure --prefix=/usr \
--disable-werror \
--enable-kernel=5.19 \
--enable-stack-protector=strong \
--with-headers=/usr/include \
libc_cv_slibdir=/usr/lib
make -j4
2. Check Glibc:
make check -j4
3. If the oops did not happen during the last step, run a specific test
in a dead loop:
while true; do make test t=malloc/tst-mallocfork3-malloc-check; done
Then an oops would likely show up in several minutes.
Though the oops is nondeterministic, I'm almost sure it's not a hardware
stability issue because I'm getting exactly same stack traces for each
oops message. I cannot easily rule out the possibility about "the
compiler miscompiles kernel code" though.
I'm running 6.2-rc8 with the following patches from loongarch-next:
ACPI: Define ACPI_MACHINE_WIDTH to 64 for LoongArch
PCI: loongson: Improve the MRRS quirk for LS7A
PCI: Add quirk for LS7A to avoid reboot failure
irqchip/loongson-liointc: Save/restore int_edge/int_pol registers during S3/S4
LoongArch: Add vector extensions support
tools: Add LoongArch build infrastructure
libbpf: Add LoongArch support to bpf_tracing.h
selftests/seccomp: Add LoongArch selftesting support
SH: cpuinfo: Fix a warning for CONFIG_CPUMASK_OFFSTACK
LoongArch: Add CPU HWMon platform driver
Any idea to fix the issue or suggestion to debug it further?