[sparc64] running stress-ng and a sparc64 hardware / kernel woes

From: Anatoly Pugachev
Date: Sun Jan 03 2021 - 07:57:46 EST


Hello!

Running a simple stress-ng as a non-privileged (non root) user :

stress-ng --opcode 1 --timeout 60 --metrics-brief

will almost always bring the linux kernel to an unusable state,
starting from "Unable to handle kernel NULL pointer dereference",
"Bogus kernel PC [0000000000000000] in fault handler", "Kernel
unaligned access at TPC", "Unable to handle kernel paging request at
virtual address" and "rcu: INFO: rcu_sched detected stalls on
CPUs/tasks"...

Can you please suggest a basic linux kernel config "debug" option(s)
to start with to try to debug this issue(s) and report kernel logs ?

Thanks.

PS: kernel logs with the last run:

[ 144.386901] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 144.386954] rcu: 0-...!: (3 GPs behind) idle=574/0/0x0
softirq=427/427 fqs=1 (false positive?)
[ 144.387029] (detected by 13, t=5252 jiffies, g=3145, q=224)
[ 144.387161] CPU[ 0]: TSTATE[0000004480001600]
TPC[000000000042c324] TNPC[000000000042c328] TASK[swapper/0:0]
[ 144.387202] TPC[arch_cpu_idle+0xa4/0xc0]
O7[arch_cpu_idle+0x90/0xc0] I7[default_idle_call+0xd8/0x1a0]
RPC[do_idle+0x110/0x1a0]
[ 144.387278] rcu: rcu_sched kthread starved for 5248 jiffies! g3145
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=23
[ 144.387317] rcu: Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[ 144.387349] rcu: RCU grace-period kthread stack dump:
[ 144.387371] task:rcu_sched state:I stack: 0 pid: 10
ppid: 2 flags:0x07000000
[ 144.387410] Call Trace:
[ 144.387425] [<0000000000bf9c4c>] schedule+0x8c/0x100
[ 144.387455] [<0000000000bfef2c>] schedule_timeout+0x16c/0x1a0
[ 144.387481] [<00000000004f2290>] rcu_gp_fqs_loop+0x1d0/0x5e0
[ 144.387508] [<00000000004f5318>] rcu_gp_kthread+0x258/0x2a0
[ 144.387534] [<000000000048c0e0>] kthread+0x100/0x120
[ 144.387558] [<00000000004060c8>] ret_from_fork+0x1c/0x2c
[ 144.387591] [<0000000000000000>] 0x0
[ 165.393918] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 165.393975] rcu: 23-...0: (1 GPs behind)
idle=16a/1/0x4000000000000000 softirq=1365/1366 fqs=1750
[ 165.394041] (detected by 13, t=5252 jiffies, g=3149, q=182)
[ 165.394185] CPU[ 23]: TSTATE[0000000000000000]
TPC[0000000000000000] TNPC[0000000000000000] TASK[NULL:-1]
[ 165.394223] TPC[0] O7[0] I7[0] RPC[0]
[ 205.290596] Unable to handle kernel paging request at virtual
address 0000000001900000
[ 205.290657] tsk->{mm,active_mm}->context = 0000000000000b3c
[ 205.290683] tsk->{mm,active_mm}->pgd = fff8000045abc000
[ 205.290707] \|/ ____ \|/
[ 205.290707] "@'/ .. \`@"
[ 205.290707] /_| \__/ |_\
[ 205.290707] \__U_/
[ 205.290760] kworker/u512:3(394): Oops [#1]
[ 205.290786] CPU: 20 PID: 394 Comm: kworker/u512:3 Tainted: G
E 5.11.0-rc1-00073-g3516bd729358 #172
[ 205.290831] Workqueue: events_unbound flush_to_ldisc
[ 205.290868] TSTATE: 0000000080e01605 TPC: 000000000049c0d4 TNPC:
000000000049c0d8 Y: 00000000 Tainted: G E
[ 205.290911] TPC: <task_curr+0x34/0x60>
[ 205.290943] g0: 0000000000000000 g1: 0000000001900100 g2:
00000000010fbd00 g3: 0000000001100100
[ 205.290977] g4: fff800003c27d680 g5: fff80021a9c20000 g6:
fff800003df98000 g7: fff8000035e58880
[ 205.291013] o0: fff800003041ca80 o1: 00000000004743d8 o2:
0000000000000a20 o3: 0000000000000001
[ 205.291048] o4: fff800003df9b0d8 o5: dfccfb3737fbd15f sp:
fff800003df9a841 ret_pc: 000000000068db94
[ 205.291083] RPC: <kmem_cache_alloc+0x3f4/0x4c0>
[ 205.291114] l0: 00000000010e11c0 l1: 000000000000000e l2:
0000000000474300 l3: 0000000000de1250
[ 205.291149] l4: 0000000000e20ad0 l5: 0000000000edcc00 l6:
fff8000037482a98 l7: 000000000114a888
[ 205.291184] i0: 0000000000000000 i1: 0000000000000a20 i2:
fff8000040443078 i3: 0000000000edc800
[ 205.291219] i4: 0000000000edbc78 i5: fff800003041ca80 i6:
fff800003df9a8f1 i7: 0000000000474c3c
[ 205.291253] I7: <complete_signal+0x5c/0x2e0>
[ 205.291279] Call Trace:
[ 205.291295] [<0000000000474c3c>] complete_signal+0x5c/0x2e0
[ 205.291322] [<00000000004754c4>] __send_signal+0x2c4/0x4a0
[ 205.291348] [<0000000000476cd0>] send_signal+0x170/0x180
[ 205.291375] [<0000000000478a20>] group_send_sig_info+0xa0/0xe0
[ 205.291404] [<0000000000478a98>] __kill_pgrp_info+0x38/0x80
[ 205.291430] [<0000000000478b00>] kill_pgrp+0x20/0x40
[ 205.291455] [<00000000009a1490>] isig+0x70/0x140
[ 205.291485] [<00000000009a1770>] n_tty_receive_signal_char+0x10/0x80
[ 205.291514] [<00000000009a2cac>] n_tty_receive_char_special+0xcc/0x6a0
[ 205.291544] [<00000000009a3408>] n_tty_receive_buf_fast+0x188/0x220
[ 205.291573] [<00000000009a396c>] __receive_buf+0x1cc/0x300
[ 205.291600] [<00000000009a3bfc>] n_tty_receive_buf_common+0x15c/0x2e0
[ 205.291629] [<00000000009a3d9c>] n_tty_receive_buf2+0x1c/0x40
[ 205.291656] [<00000000009a6864>] tty_ldisc_receive_buf+0x24/0x80
[ 205.291686] [<00000000009a74b8>] tty_port_default_receive_buf+0x38/0x60
[ 205.291716] [<00000000009a6e30>] flush_to_ldisc+0xb0/0x120
[ 205.291741] Disabling lock debugging due to kernel taint
[ 205.291750] Caller[0000000000474c3c]: complete_signal+0x5c/0x2e0
[ 205.291760] Caller[00000000004754c4]: __send_signal+0x2c4/0x4a0
[ 205.291770] Caller[0000000000476cd0]: send_signal+0x170/0x180
[ 205.291780] Caller[0000000000478a20]: group_send_sig_info+0xa0/0xe0
[ 205.291790] Caller[0000000000478a98]: __kill_pgrp_info+0x38/0x80
[ 205.291800] Caller[0000000000478b00]: kill_pgrp+0x20/0x40
[ 205.291809] Caller[00000000009a1490]: isig+0x70/0x140
[ 205.291819] Caller[00000000009a1770]: n_tty_receive_signal_char+0x10/0x80
[ 205.291829] Caller[00000000009a2cac]: n_tty_receive_char_special+0xcc/0x6a0
[ 205.291840] Caller[00000000009a3408]: n_tty_receive_buf_fast+0x188/0x220
[ 205.291850] Caller[00000000009a396c]: __receive_buf+0x1cc/0x300
[ 205.291860] Caller[00000000009a3bfc]: n_tty_receive_buf_common+0x15c/0x2e0
[ 205.291871] Caller[00000000009a3d9c]: n_tty_receive_buf2+0x1c/0x40
[ 205.291881] Caller[00000000009a6864]: tty_ldisc_receive_buf+0x24/0x80
[ 205.291891] Caller[00000000009a74b8]: tty_port_default_receive_buf+0x38/0x60
[ 205.291901] Caller[00000000009a6e30]: flush_to_ldisc+0xb0/0x120
[ 205.291910] Caller[0000000000484eb0]: process_one_work+0x390/0x620
[ 205.291925] Caller[00000000004853f0]: worker_thread+0x2b0/0x4a0
[ 205.291935] Caller[000000000048c0e0]: kthread+0x100/0x120
[ 205.291946] Caller[00000000004060c8]: ret_from_fork+0x1c/0x2c
[ 205.291962] Caller[0000000000000000]: 0x0
[ 205.291971] Instruction DUMP:
[ 205.291974] 83306010
[ 205.291980] 83287008
[ 205.291987] 8200c001
[ 205.291992] <c25860f8>
[ 205.291999] 84004002
[ 205.292005] c258aa28
[ 205.292011] 8e19c001
[ 205.292017] 81cfe008
[ 205.292023] 9179e401
[ 205.292029]
[ 228.411239] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 228.411273] rcu: 23-...0: (1 GPs behind)
idle=16a/1/0x4000000000000000 softirq=1365/1366 fqs=7002
[ 228.411312] (detected by 9, t=21007 jiffies, g=3149, q=224)
[ 228.411439] CPU[ 23]: TSTATE[0000000000000000]
TPC[0000000000000000] TNPC[0000000000000000] TASK[NULL:-1]
[ 228.411451] TPC[0] O7[0] I7[0] RPC[0]
[ 291.429047] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 291.429084] rcu: 23-...0: (1 GPs behind)
idle=16a/1/0x4000000000000000 softirq=1365/1366 fqs=12254
[ 291.429129] (detected by 9, t=36762 jiffies, g=3149, q=265)
[ 291.429256] CPU[ 23]: TSTATE[0000000000000000]
TPC[0000000000000000] TNPC[0000000000000000] TASK[NULL:-1]
[ 291.429269] TPC[0] O7[0] I7[0] RPC[0]
[ 312.540280] watchdog: BUG: soft lockup - CPU#16 stuck for 23s!
[systemd-logind:652]
[ 312.540318] Modules linked in: tun(E) binfmt_misc(E)
camellia_sparc64(E) des_sparc64(E) aes_sparc64(E) n2_rng(E)
md5_sparc64(E) rng_core(E) flash(E) sha512_sparc64(E)
sha256_sparc64(E) sha1_sparc64(E) nft_ct(E) nf_conntrack(E)
nf_defrag_ipv6(E) nf_defrag_ipv4(E) nft_counter(E) nf_tables(E)
nfnetlink(E) sunrpc(E) fuse(E) configfs(E) ip_tables(E) x_tables(E)
autofs4(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E)
async_pq(E) raid6_pq(E) async_xor(E) xor(E) async_tx(E) raid1(E)
raid0(E) multipath(E) linear(E) md_mod(E) dm_mod(E) crc32c_sparc64(E)
sunvnet(E) sunvnet_common(E)
[ 312.540485] CPU: 16 PID: 652 Comm: systemd-logind Tainted: G D
W E 5.11.0-rc1-00073-g3516bd729358 #172
[ 312.540501] TSTATE: 0000000011001601 TPC: 00000000005183ec TNPC:
00000000005183f0 Y: 00000006 Tainted: G D W E
[ 312.540514] TPC: <smp_call_function_many_cond+0x38c/0x3e0>
[ 312.540529] g0: 000000000069952c g1: 0000000000000011 g2:
fff80021aade3420 g3: 0000000000000017
[ 312.540540] g4: fff8000044eb6180 g5: fff80021a9b20000 g6:
fff8000044f7c000 g7: 0000000000c00000
[ 312.540551] o0: 0000000000000017 o1: fff80021aac1ccc8 o2:
fff80021aaddcc80 o3: fff80021aac1cce8
[ 312.540562] o4: 0000000000000000 o5: fff80021aac1cce8 sp:
fff8000044f7e931 ret_pc: 0000000000518408
[ 312.540572] RPC: <smp_call_function_many_cond+0x3a8/0x3e0>
[ 312.540582] l0: 0000000000000100 l1: fff80021aac1ccc8 l2:
0000000000edbe64 l3: 0000000001100100
[ 312.540592] l4: 0000000001103420 l5: fff80021aac1cce8 l6:
0000000000000017 l7: 0000000000000100
[ 312.540602] i0: fff80021aac1ccc0 i1: 000000000043f000 i2:
fff8000044f7f358 i3: 0000000000000001
[ 312.540612] i4: 0000000000000000 i5: 0000000001100100 i6:
fff8000044f7e9f1 i7: 000000000051845c
[ 312.540622] I7: <smp_call_function_many+0x1c/0x40>
[ 312.540632] Call Trace:
[ 312.540639] [<000000000051845c>] smp_call_function_many+0x1c/0x40
[ 312.540649] [<00000000004402d4>] smp_flush_tlb_pending+0x34/0x60
[ 312.540662] [<000000000044ed14>] flush_tlb_pending+0x74/0xa0
[ 312.540676] [<000000000044eea4>] arch_leave_lazy_mmu_mode+0x24/0x40
[ 312.540687] [<000000000064a81c>] zap_pte_range+0x59c/0x6c0
[ 312.540698] [<000000000064aac8>] unmap_page_range+0x188/0x240
[ 312.540708] [<000000000064ac40>] unmap_single_vma+0xc0/0xe0
[ 312.540717] [<000000000064ad7c>] unmap_vmas+0x1c/0x60
[ 312.540726] [<00000000006543cc>] exit_mmap+0x14c/0x1e0
[ 312.540742] [<000000000045fcf8>] mmput+0x58/0x120
[ 312.540754] [<0000000000469cf8>] exit_mm+0x1b8/0x200
[ 312.540766] [<000000000046a1f4>] do_exit+0x4b4/0x500
[ 312.540776] [<000000000046a364>] do_group_exit+0xa4/0xc0
[ 312.540786] [<00000000004786c0>] get_signal+0x600/0x680
[ 312.540800] [<000000000042d52c>] do_signal+0x4c/0x1c0
[ 312.540817] [<000000000042de70>] do_notify_resume+0x30/0x80
[ 354.446709] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 354.446747] rcu: 23-...0: (1 GPs behind)
idle=16a/1/0x4000000000000000 softirq=1365/1366 fqs=17506
[ 354.446792] (detected by 13, t=52517 jiffies, g=3149, q=288)
[ 354.446920] CPU[ 23]: TSTATE[0000000000000000]
TPC[0000000000000000] TNPC[0000000000000000] TASK[NULL:-1]
[ 354.446933] TPC[0] O7[0] I7[0] RPC[0]
[ 417.464277] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 417.464314] rcu: 23-...0: (1 GPs behind)
idle=16a/1/0x4000000000000000 softirq=1365/1366 fqs=22758
[ 417.464359] (detected by 13, t=68272 jiffies, g=3149, q=288)
[ 417.464487] CPU[ 23]: TSTATE[0000000000000000]
TPC[0000000000000000] TNPC[0000000000000000] TASK[NULL:-1]
[ 417.464500] TPC[0] O7[0] I7[0] RPC[0]
[ 480.481809] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 480.481847] rcu: 23-...0: (1 GPs behind)
idle=16a/1/0x4000000000000000 softirq=1365/1366 fqs=28010
[ 480.481890] (detected by 13, t=84027 jiffies, g=3149, q=288)
[ 480.482018] CPU[ 23]: TSTATE[0000000000000000]
TPC[0000000000000000] TNPC[0000000000000000] TASK[NULL:-1]
[ 480.482031] TPC[0] O7[0] I7[0] RPC[0]
[ 543.499328] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 543.499366] rcu: 23-...0: (1 GPs behind)
idle=16a/1/0x4000000000000000 softirq=1365/1366 fqs=33262
[ 543.499411] (detected by 13, t=99782 jiffies, g=3149, q=288)
[ 543.499538] CPU[ 23]: TSTATE[0000000000000000]
TPC[0000000000000000] TNPC[0000000000000000] TASK[NULL:-1]
[ 543.499550] TPC[0] O7[0] I7[0] RPC[0]
[ 606.516843] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 606.516881] rcu: 23-...0: (1 GPs behind)
idle=16a/1/0x4000000000000000 softirq=1365/1366 fqs=38514
[ 606.516924] (detected by 13, t=115537 jiffies, g=3149, q=288)
[ 606.517052] CPU[ 23]: TSTATE[0000000000000000]
TPC[0000000000000000] TNPC[0000000000000000] TASK[NULL:-1]
[ 606.517065] TPC[0] O7[0] I7[0] RPC[0]
[ 624.568155] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0
nice=0 stuck for 306s!
[ 624.568322] Showing busy workqueues and worker pools:
[ 624.568374] workqueue events: flags=0x0
[ 624.568639] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 624.568663] pending: psi_avgs_work
[ 624.568685] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 624.568707] pending: vmstat_shepherd
[ 624.568813] workqueue events_unbound: flags=0x2
[ 624.568822] pwq 512: cpus=0-255 flags=0x4 nice=0 active=2/1024 refcnt=4
[ 624.568850] in-flight: 394:flush_to_ldisc flush_to_ldisc
[ 624.569131] workqueue mm_percpu_wq: flags=0x8
[ 624.569356] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 624.569377] pending: vmstat_update
[ 624.573421] pool 512: cpus=0-255 flags=0x4 nice=0 hung=302s
workers=5 idle: 395 166 7 281
[ 669.534354] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 669.534393] rcu: 23-...0: (1 GPs behind)
idle=16a/1/0x4000000000000000 softirq=1365/1366 fqs=43766
[ 669.534439] (detected by 16, t=131292 jiffies, g=3149, q=292)
[ 669.534568] CPU[ 23]: TSTATE[0000000000000000]
TPC[0000000000000000] TNPC[0000000000000000] TASK[NULL:-1]
[ 669.534581] TPC[0] O7[0] I7[0] RPC[0]
[ 696.525288] watchdog: BUG: soft lockup - CPU#16 stuck for 22s!
[systemd-logind:652]
[ 696.525330] Modules linked in: tun(E) binfmt_misc(E)
camellia_sparc64(E) des_sparc64(E) aes_sparc64(E) n2_rng(E)
md5_sparc64(E) rng_core(E) flash(E) sha512_sparc64(E)
sha256_sparc64(E) sha1_sparc64(E) nft_ct(E) nf_conntrack(E)
nf_defrag_ipv6(E) nf_defrag_ipv4(E) nft_counter(E) nf_tables(E)
nfnetlink(E) sunrpc(E) fuse(E) configfs(E) ip_tables(E) x_tables(E)
autofs4(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E)
async_pq(E) raid6_pq(E) async_xor(E) xor(E) async_tx(E) raid1(E)
raid0(E) multipath(E) linear(E) md_mod(E) dm_mod(E) crc32c_sparc64(E)
sunvnet(E) sunvnet_common(E)
[ 696.525498] CPU: 16 PID: 652 Comm: systemd-logind Tainted: G D
W EL 5.11.0-rc1-00073-g3516bd729358 #172
[ 696.525514] TSTATE: 0000000011001601 TPC: 00000000005183dc TNPC:
00000000005183e0 Y: 00000006 Tainted: G D W EL
[ 696.525526] TPC: <smp_call_function_many_cond+0x37c/0x3e0>
[ 696.525542] g0: 000000000069952c g1: 0000000000000011 g2:
fff80021aade3420 g3: 0000000000000017
[ 696.525553] g4: fff8000044eb6180 g5: fff80021a9b20000 g6:
fff8000044f7c000 g7: 0000000000c00000
[ 696.525563] o0: 0000000000000017 o1: fff80021aac1ccc8 o2:
fff80021aaddcc80 o3: fff80021aac1cce8
[ 696.525574] o4: 0000000000000000 o5: fff80021aac1cce8 sp:
fff8000044f7e931 ret_pc: 0000000000518408
[ 696.525584] RPC: <smp_call_function_many_cond+0x3a8/0x3e0>
[ 696.525594] l0: 0000000000000100 l1: fff80021aac1ccc8 l2:
0000000000edbe64 l3: 0000000001100100
[ 696.525604] l4: 0000000001103420 l5: fff80021aac1cce8 l6:
0000000000000017 l7: 0000000000000100
[ 696.525615] i0: fff80021aac1ccc0 i1: 000000000043f000 i2:
fff8000044f7f358 i3: 0000000000000001
[ 696.525625] i4: 0000000000000000 i5: 0000000001100100 i6:
fff8000044f7e9f1 i7: 000000000051845c
[ 696.525635] I7: <smp_call_function_many+0x1c/0x40>
[ 696.525645] Call Trace:
[ 696.525652] [<000000000051845c>] smp_call_function_many+0x1c/0x40
[ 696.525663] [<00000000004402d4>] smp_flush_tlb_pending+0x34/0x60
[ 696.525675] [<000000000044ed14>] flush_tlb_pending+0x74/0xa0
[ 696.525689] [<000000000044eea4>] arch_leave_lazy_mmu_mode+0x24/0x40
[ 696.525700] [<000000000064a81c>] zap_pte_range+0x59c/0x6c0
[ 696.525712] [<000000000064aac8>] unmap_page_range+0x188/0x240
[ 696.525722] [<000000000064ac40>] unmap_single_vma+0xc0/0xe0
[ 696.525731] [<000000000064ad7c>] unmap_vmas+0x1c/0x60
[ 696.525741] [<00000000006543cc>] exit_mmap+0x14c/0x1e0
[ 696.525756] [<000000000045fcf8>] mmput+0x58/0x120
[ 696.525769] [<0000000000469cf8>] exit_mm+0x1b8/0x200
[ 696.525781] [<000000000046a1f4>] do_exit+0x4b4/0x500
[ 696.525791] [<000000000046a364>] do_group_exit+0xa4/0xc0
[ 696.525801] [<00000000004786c0>] get_signal+0x600/0x680
[ 696.525815] [<000000000042d52c>] do_signal+0x4c/0x1c0
[ 696.525832] [<000000000042de70>] do_notify_resume+0x30/0x80