Re: [PATCH] timer_list: avoid other cpu soft lockup when printing timer list

From: Yang Yingliang
Date: Mon Mar 09 2020 - 04:21:20 EST


Hi,

sorry for the late reply.

On 2020/2/21 9:41, Stephen Boyd wrote:
Quoting Yang Yingliang (2020-02-19 19:42:32)
If system has many cpus (e.g. 128), it will spend a lot of time to
print message to the console when execute echo q > /proc/sysrq-trigger.

When /proc/sys/kernel/numa_balancing is enabled, if the migration threads
are woke up, the migration thread that on print mesasage cpu can't run
until the print finish, another migration thread may trigger soft lockup.

PID: 619 TASK: ffffa02fdd8bec80 CPU: 121 COMMAND: "migration/121"
#0 [ffff00000a103b10] __crash_kexec at ffff0000081bf200
#1 [ffff00000a103ca0] panic at ffff0000080ec93c
#2 [ffff00000a103d80] watchdog_timer_fn at ffff0000081f8a14
#3 [ffff00000a103e00] __run_hrtimer at ffff00000819701c
#4 [ffff00000a103e40] __hrtimer_run_queues at ffff000008197420
#5 [ffff00000a103ea0] hrtimer_interrupt at ffff00000819831c
#6 [ffff00000a103f10] arch_timer_dying_cpu at ffff000008b53144
#7 [ffff00000a103f30] handle_percpu_devid_irq at ffff000008174e34
#8 [ffff00000a103f70] generic_handle_irq at ffff00000816c5e8
#9 [ffff00000a103f90] __handle_domain_irq at ffff00000816d1f4
#10 [ffff00000a103fd0] gic_handle_irq at ffff000008081860
--- <IRQ stack> ---
#11 [ffff00000d6e3d50] el1_irq at ffff0000080834c8
#12 [ffff00000d6e3d60] multi_cpu_stop at ffff0000081d9964
#13 [ffff00000d6e3db0] cpu_stopper_thread at ffff0000081d9cfc
#14 [ffff00000d6e3e10] smpboot_thread_fn at ffff00000811e0a8
#15 [ffff00000d6e3e70] kthread at ffff000008118988

To avoid this soft lockup, add touch_all_softlockup_watchdogs()
in sysrq_timer_list_show()

Signed-off-by: Yang Yingliang <yangyingliang@xxxxxxxxxx>
---
kernel/time/timer_list.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index acb326f..4cb0e6f 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -289,13 +289,17 @@ void sysrq_timer_list_show(void)
timer_list_header(NULL, now);
- for_each_online_cpu(cpu)
+ for_each_online_cpu(cpu) {
+ touch_all_softlockup_watchdogs();
Usage of touch_all_softlockup_watchdogs() deserves a comment. Otherwise
the reader is left to git archaeology to understand why watchdogs are
being touched. Of course, we failed at that with commit 010704276865
("sysrq: Reset the watchdog timers while displaying high-resolution
timers") which looks awfully similar to this.
OK, I will add a comment later.

print_cpu(NULL, cpu, now);
+ }
#ifdef CONFIG_GENERIC_CLOCKEVENTS
timer_list_show_tickdevices_header(NULL);
- for_each_online_cpu(cpu)
+ for_each_online_cpu(cpu) {
+ touch_all_softlockup_watchdogs();
print_tickdevice(NULL, tick_get_device(cpu), cpu);
print_tickdevice() already has touch_nmi_watchdog() which eventually
touches the softlockup watchdog. Is the problem that it isn't enough to
do that when the migration thread is also running?
No, it's not enough.
The soft lockup occurs on other cpu, so other cpu's soft watchdog need to be touched.


+ }
#endif
return;
.