Problem about " rcu_sched self-detected stall on CPU "on arm64 platform

From: majun (F)
Date: Sun Sep 20 2015 - 23:31:29 EST


Hi all:
I have a cpu stall problem need you help.

On my arm64 board, when
[1] set maxcpus=17 or other value < 32 and > 16 (total 32 cpus in soc with 2 cpu die. each die has 16 cpus)
[2] enable CONFIG_NUMA or CONFIG_SCHED_MC or both.
system would stall on cpu( log list as below)

When I set maxcpus=32, this problem gone and system boots fine.

If you ever meet or know about this problem,please give me some suggestion.
Thanks
Ma Jun

//-------log-----------------

[ OK ] Reached target Swap.
[ OK ] Mounted Debug File System.
[ OK ] Mounted Huge Pages File System.
[ OK ] Mounted POSIX Message Queue File System.
[ OK ] Started Create static device nodes in /dev.
[ OK ] Started udev Coldplug all Devices.
INFO: rcu_sched self-detected stall on CPU
16: (5250 ticks this GP) idle=407/140000000000001/0 softirq=527/527 fqs=5242
INFO: rcu_sched detected stalls on CPUs/tasks:
16: (5250 ticks this GP) idle=407/140000000000001/0 softirq=527/527 fqs=5242
(detected by 0, t=5252 jiffies, g=229, c=228, q=3574)
Task dump for CPU 16:
systemd-journal R running task 0 978 1 0x00000002
Call trace:
[<ffffffc000086c5c>] __switch_to+0x74/0x8c
(t=5260 jiffies g=229 c=228 q=3574)
Task dump for CPU 16:
systemd-journal R running task 0 978 1 0x00000002
Call trace:
[<ffffffc000089904>] dump_backtrace+0x0/0x124
[<ffffffc000089a38>] show_stack+0x10/0x1c
[<ffffffc0000d65f4>] sched_show_task+0x94/0xdc
[<ffffffc0000d99b0>] dump_cpu_task+0x3c/0x4c
[<ffffffc0000f947c>] rcu_dump_cpu_stacks+0x98/0xe8
[<ffffffc0000fca34>] rcu_check_callbacks+0x47c/0x788
[<ffffffc0000ffddc>] update_process_times+0x38/0x6c
[<ffffffc00010ec80>] tick_sched_handle.isra.16+0x1c/0x68
[<ffffffc00010ed0c>] tick_sched_timer+0x40/0x88
[<ffffffc00010088c>] __run_hrtimer.isra.34+0x4c/0x10c
[<ffffffc000100b88>] hrtimer_interrupt+0xd0/0x258
[<ffffffc0004f0acc>] arch_timer_handler_phys+0x28/0x38
[<ffffffc0000f3760>] handle_percpu_devid_irq+0x74/0x9c
[<ffffffc0000ef524>] generic_handle_irq+0x30/0x4c
[<ffffffc0000ef83c>] __handle_domain_irq+0x5c/0xac
[<ffffffc000082524>] gic_handle_irq+0xb8/0x1c8
Exception stack(0xffffffef5f343af0 to 0xffffffef5f343c10)
3ae0: 7fb069c0 ffffffef 7fb069c8 ffffffef
3b00: 5f343c70 ffffffef 00113990 ffffffc0 80000145 00000000 00000001 00000000
3b20: 001130c4 ffffffc0 00000000 00000000 00856718 ffffffc0 00000040 00000000
3b40: 00000210 00000000 00856000 ffffffc0 7fa19ff8 ffffffef 7fa19fe0 ffffffef
3b60: 00000001 00000000 008568f0 ffffffc0 00000001 00000000 fffffffe ffffffff
3b80: 00000000 00000000 00000000 00000000 00000900 00000000 65747379 6a2f646d
3ba0: 6c616e72 636f732f 6e72756f 732f6c61 65747379 6a2f646d ffffffff ffffffff
3bc0: 95716a94 0000007f 00005749 00000000 00511f54 ffffffc0 957f09d0 0000007f
3be0: e42b0110 0000007f 7fb069c0 ffffffef 7fb069c8 ffffffef 00856000 ffffffc0
3c00: 00856718 ffffffc0 00846980 ffffffc0
[<ffffffc0000855a4>] el1_irq+0x64/0xc0
[<ffffffc000113ab0>] kick_all_cpus_sync+0x24/0x30
[<ffffffc00008c4ac>] aarch64_insn_patch_text+0x84/0x90
[<ffffffc000093860>] arch_jump_label_transform+0x58/0x64
[<ffffffc00013970c>] __jump_label_update+0x68/0x84
[<ffffffc0001397ac>] jump_label_update+0x84/0xa8
[<ffffffc0001398c4>] static_key_slow_inc+0xf4/0xfc
[<ffffffc000524bd8>] net_enable_timestamp+0x6c/0x7c
[<ffffffc000516050>] sock_enable_timestamp+0x70/0x7c
[<ffffffc000516290>] sock_setsockopt+0x234/0x838
[<ffffffc000511fe8>] SyS_setsockopt+0x94/0xa8
NMI watchdog: BUG: soft lockup - CPU#16 stuck for 22s! [systemd-journal:978]
Modules linked in:

CPU: 16 PID: 978 Comm: systemd-journal Not tainted 4.1.6+ #9
Hardware name: Hisilicon PhosphorV660 2P1S Development Board (DT)
task: ffffffef5f38b700 ti: ffffffef5f340000 task.ti: ffffffef5f340000
PC is at smp_call_function_many+0x284/0x2f0
LR is at smp_call_function_many+0x250/0x2f0
pc : [<ffffffc000113990>] lr : [<ffffffc00011395c>] pstate: 80000145
sp : ffffffef5f343c70
x29: ffffffef5f343c70 x28: 0000000000000040
x27: ffffffc000856718 x26: 0000000000000000
x25: ffffffc0001130c4 x24: 0000000000000001
x23: ffffffc000846980 x22: ffffffc000856718
x21: ffffffc000856000 x20: ffffffef7fb069c8
x19: ffffffef7fb069c0 x18: 0000007fe42b0110
x17: 0000007f957f09d0 x16: ffffffc000511f54
x15: 0000000000005749 x14: 0000007f95716a94
x13: ffffffffffffffff x12: 6a2f646d65747379
x11: 732f6c616e72756f x10: 636f732f6c616e72
x9 : 6a2f646d65747379 x8 : 0000000000000900
x7 : 0000000000000000 x6 : 0000000000000000
x5 : fffffffffffffffe x4 : 0000000000000001
x3 : ffffffc0008568f0 x2 : 0000000000000001
x1 : ffffffef7fa19fe0 x0 : ffffffef7fa19ff8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/